XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Weird kern.log errors

    Scheduled Pinned Locked Moved Solved Compute
    22 Posts 5 Posters 5.8k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Segfault on tapdisk isn't common. I would start to run Memtest on that machine first 🙂

      D 1 Reply Last reply Reply Quote 0
      • D Offline
        dredknight @olivierlambert
        last edited by

        Thank you both 🙂

        We run memtest twice, but nothing was reported.

        Hardware is:

        • host: Dell PowerEdge C6525
        • CPU: 2 x EPYC 7532 32 Core
        • Memory: 16 x 16GB 3200 MHz DDR4
        • Disks: 2 x 512 GB SSDs in RAID 1
        • RAID: Dell Perc H345

        BIOS is v. 2.10.2
        XCP is version 8.2.1 latest update (yum updated 10 days ago when it was reinstalled). Couldnt find where to get the exact patch number.

        We have other servers with the same setup but just this one is having those issues.

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          Adding @fohdeesha in the loop.

          To me, that's maybe the issue: this hardware seems totally fine and if it works on other machines but not this one, I'm kind of feeling there's an issue due to this very machine;

          1 Reply Last reply Reply Quote 0
          • D Offline
            dredknight
            last edited by

            after some more debugging we found this issues reported by dmesg

            dmesg |grep WARN -A 5
            --
            [126340.468112] WARNING: CPU: 11 PID: 1311 at arch/x86/xen/multicalls.c:130 xen_mc_flush+0x1aa/0x1c0
            [126340.468113] Modules linked in: ebtable_nat arptable_filter arp_tables xt_set ip_set_hash_net ip_set nfnetlink tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache ebt_arp ebt_ip ebtable_filter ebtables xt_physdev br_netfilter bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c dm_multipath iptable_filter sr_mod cdrom sunrpc nls_iso8859_1 nls_cp437 vfat fat dm_mod uas usb_storage dcdbas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables hid_generic usbhid hid sd_mod ahci libahci
            [126340.468140]  i40e(O) xhci_pci libata megaraid_sas(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
            [126340.468145] CPU: 11 PID: 1311 Comm: forkexecd Tainted: G        W  O      4.19.0+1 #1
            [126340.468146] Hardware name: Dell Inc. PowerEdge C6525/04DK47, BIOS 2.10.2 10/25/2022
            [126340.468147] RIP: e030:xen_mc_flush+0x1aa/0x1c0
            --
            [126340.468351] WARNING: CPU: 12 PID: 0 at arch/x86/xen/multicalls.c:130 xen_mc_flush+0x1aa/0x1c0
            [126340.468358] Modules linked in: ebtable_nat arptable_filter arp_tables xt_set ip_set_hash_net ip_set nfnetlink tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache ebt_arp ebt_ip ebtable_filter ebtables xt_physdev br_netfilter bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c dm_multipath iptable_filter sr_mod cdrom sunrpc nls_iso8859_1 nls_cp437 vfat fat dm_mod uas usb_storage dcdbas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables hid_generic usbhid hid sd_mod ahci libahci
            [126340.468423]  i40e(O) xhci_pci libata megaraid_sas(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
            [126340.468438] CPU: 12 PID: 0 Comm: swapper/12 Tainted: G        W  O      4.19.0+1 #1
            [126340.468439] Hardware name: Dell Inc. PowerEdge C6525/04DK47, BIOS 2.10.2 10/25/2022
            [126340.468443] RIP: e030:xen_mc_flush+0x1aa/0x1c0
            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Interesting. Can you check to have all your BIOS/firmware to their latest versions?

              D 1 Reply Last reply Reply Quote 0
              • D Offline
                dredknight @olivierlambert
                last edited by

                @olivierlambert here are the versions we have:

                • BIOS 2.10.2
                  I- drac - 6.10.30.0
                • Ethernet - 30.6.24
                • RAID backplane - 4.36
                • RAID Slot - 51.16.0-4076

                Devices have latest firmware, updated 2 weeks ago as recommended by Dell.

                1 Reply Last reply Reply Quote 0
                • DanpD Offline
                  Danp Pro Support Team
                  last edited by

                  And your other servers are running these same BIOS versions?

                  D 1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    I don't remember all the versions in mind but what matters is to have all of them fully up to date 🙂 (so if it's the case, it's good!)

                    I'm adding @andSmv and @bleader in the loop so they can take a look at your log to see if it rings a bell.

                    1 Reply Last reply Reply Quote 0
                    • D Offline
                      dredknight @Danp
                      last edited by dredknight

                      @Danp yes all servers came in one batch and had the same hardware and software installed at the time.
                      By all, I mean 36 in total same hardware and same workload. So it is just 1/36 error rate which pretty much means hardware
                      issue for me, but yet there is no diagnostic report of it which is quite peculiar.

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        And you can reproduce the issue only on this one? 🤔

                        D 1 Reply Last reply Reply Quote 0
                        • D Offline
                          dredknight @olivierlambert
                          last edited by dredknight

                          Yes only on this one it happens in the span of 2-5 days after complete reset.
                          From user perspective we know the following.

                          Server has 2 vms running on 80-100% CPU utilization (each server has 64 cores assigned, the server has 2 EPYC CPUs for a total of 128 cores).
                          After the issue occurs one of the VM becomes unresponsive and the other is kind of fine - you can login but no commands can be executed on it. For example you type "top" or "df -h" press enter and it stays like that indefinitely with no output.

                          One tip I got, though a bit far-fetched, is it can be a "cosmic ray" behaviour. I dont know about that but so far nothing else can be tracked.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by olivierlambert

                            When do you say "host", are you talking about the physical host or a VM?

                            D 1 Reply Last reply Reply Quote 0
                            • D Offline
                              dredknight @olivierlambert
                              last edited by dredknight

                              @olivierlambert host = physical server. I fixed it in the last post.

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                That's indeed a clue if it happens ONLY on this machine if you run the same VMs on others without triggering any problem, and if you have exact same versions of XCP-ng/BIOS/firmware and the same hardware 🤔

                                D 1 Reply Last reply Reply Quote 0
                                • D Offline
                                  dredknight @olivierlambert
                                  last edited by

                                  @olivierlambert said in Weird kern.log errors:

                                  That's indeed a clue if it happens ONLY on this machine if you run the same VMs on others without triggering any problem, and if you have exact same versions of XCP-ng/BIOS/firmware and the same hardware

                                  Reply

                                  All VMs spin VMs from the same image (Using CloudStack on top) and all workload pushed on them is the same.
                                  If one VM was the issue it would make sense the VM is at fault, but in the current case both VMs on that server get broken at the same time.

                                  All servers have 2 SSDs in RAID 1. In the last iteration we did not use RAID and placed 1 VM on different SSD disk just in case the issue can come from there. The problem still appeared in the same way as before.

                                  We are thinking of other ways to hold tests at the moment. Will keep you updated :).

                                  1 Reply Last reply Reply Quote 1
                                  • olivierlambertO Offline
                                    olivierlambert Vates 🪐 Co-Founder CEO
                                    last edited by

                                    Please do, I'm eager to know the root cause 🙂

                                    1 Reply Last reply Reply Quote 0
                                    • bleaderB Offline
                                      bleader Vates 🪐 XCP-ng Team
                                      last edited by

                                      It doesn't ring a bell as it is for me.

                                      What I see from the first log is the segfault on blktap and in xcp-rrdd-xenpm, likely that was while writing to a disk. In all cases, it is a xen_mc_flush() call.

                                      Given it happens on a single machine, I would venture it could be related to the disk controller, or disk itself, you could try to have a look at a dmidecode to see if the controllers are the same as on othe machines (sometimes there are small discrepencies between supposedly identical hardware), and check the drives with smartctl for any health issues. But especially as you were on raid1 originally, I doubt an issue with the drives themselves would lead to such an issue...

                                      1 Reply Last reply Reply Quote 1
                                      • andSmvA Offline
                                        andSmv Vates 🪐 XCP-ng Team Xen Guru
                                        last edited by

                                        Yeah, The HW problem seems to be a good guess.

                                        The track that we can follow here is xen_mc_flush kernel function which raises a warning when a multicall (hypercall wrapper) fails. The interesting thing here would be to take a look at XEN traces. You can type xl dmesg in dom0 to see if XEN tells something more (if it isn't happy on some reason)

                                        1 Reply Last reply Reply Quote 0
                                        • D Offline
                                          dredknight
                                          last edited by

                                          FYI - it was a CPU issue.
                                          We changed the CPUs between servers and it moved with them.

                                          Thanks for the tips everyone!

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by

                                            Ahh great news! CPU issues are really tricky 🙂

                                            1 Reply Last reply Reply Quote 0
                                            • olivierlambertO olivierlambert marked this topic as a question on
                                            • olivierlambertO olivierlambert has marked this topic as solved on
                                            • First post
                                              Last post