XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    host crash - guest_4.o#sh_page_fault__guest

    Scheduled Pinned Locked Moved Solved Development
    24 Posts 5 Posters 8.4k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • borzelB Offline
      borzel XCP-ng Center Team
      last edited by borzel

      Hello,

      our new XCP-ng Host (member of a pool) did crash last night:

      • XCP-ng 7.6 (latest updates, no extra or testing packages)
      • Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
      • 256 GB ECC DDR4 RAM

      I have a complete crashdump on hand. It seems that a guest had a page fault (on pCPU9) and this crashed the whole host. I did read trough the crashdump, but have no clue 😕

      Highlight:

      (XEN) [342270.016835] ----[ Xen-4.7.6-6.2.1.xcp  x86_64  debug=n   Not tainted ]----
      (XEN) [342270.016837] CPU:    9
      (XEN) [342270.016840] RIP:    e008:[<ffff82d08022ca73>] guest_4.o#sh_page_fault__guest_4+0xb62/0x1fa0
      (XEN) [342270.016845] RFLAGS: 0000000000010282   CONTEXT: hypervisor (d13v0)
      
      (XEN) [342270.016993] Panic on CPU 9:
      (XEN) [342270.016994] FATAL PAGE FAULT
      (XEN) [342270.016995] [error_code=0000]
      

      This host was running some oracle linux and some ubuntu's.
      --> How can I see if the bad VM was a PV or HVM guest?

      Thanks for any help!

      xen.log and xen.pcpu9.stack.log from the crashdump:
      https://transfer.delta-barth.de/index.php/s/HZFYGEY6agwzB3J
      Password: xencrash

      Edit: Link removed, Problem is solved.

      1 Reply Last reply Reply Quote 0
      • R Offline
        r1 XCP-ng Team
        last edited by

        This is Xen crash. Let me see if I can look at the function from source. I found the srpm at https://updates.xcp-ng.org/7/dev/updates/Source/SPackages/ and it seems you are not fully updated.. the recent package should xen-hypervisor-4.7.6-6.3.1.xcp.

        1 Reply Last reply Reply Quote 0
        • R Offline
          r1 XCP-ng Team
          last edited by

          How many VMs were running there?

          PV VM list => # xe vm-list PV-bootloader=pygrub

          1 Reply Last reply Reply Quote 0
          • R Offline
            r1 XCP-ng Team
            last edited by r1

            Found that your package is missing a critical patch https://xenbits.xen.org/gitweb/?p=xen.git;a=patch;h=ab6d56c4cac1498f20e5cde99a6e8af5f45d2bb0 or XSA-280.

            There is a check in place to avoid this crash.

            diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
            index cb24a0fdef..1ec3e35d20 100644
            --- a/xen/arch/x86/mm/shadow/multi.c
            +++ b/xen/arch/x86/mm/shadow/multi.c
            @@ -3314,8 +3314,8 @@ static int sh_page_fault(struct vcpu *v,
             
                 /* Unshadow if we are writing to a toplevel pagetable that is
                  * flagged as a dying process, and that is not currently used. */
            -    if ( sh_mfn_is_a_page_table(gmfn)
            -         && (mfn_to_page(gmfn)->shadow_flags & SHF_pagetable_dying) )
            +    if ( sh_mfn_is_a_page_table(gmfn) && is_hvm_domain(d) &&
            +         mfn_to_page(gmfn)->pagetable_dying )
                 {
                     int used = 0;
                     struct vcpu *tmp;
            @@ -4256,9 +4256,9 @@ int sh_rm_write_access_from_sl1p(struct domain *d, mfn_t gmfn,
                 ASSERT(mfn_valid(smfn));
             
                 /* Remember if we've been told that this process is being torn down */
            -    if ( curr->domain == d )
            +    if ( curr->domain == d && is_hvm_domain(d) )
                     curr->arch.paging.shadow.pagetable_dying
            -            = !!(mfn_to_page(gmfn)->shadow_flags & SHF_pagetable_dying);
            +            = mfn_to_page(gmfn)->pagetable_dying;
             
                 sp = mfn_to_page(smfn);
             
            @@ -4575,10 +4575,10 @@ static void sh_pagetable_dying(struct vcpu *v, paddr_t gpa)
                         smfn = shadow_hash_lookup(d, mfn_x(gmfn), SH_type_l2_pae_shadow);
                     }
             
            -        if ( mfn_valid(smfn) )
            +        if ( mfn_valid(smfn) && is_hvm_domain(d) )
                     {
                         gmfn = _mfn(mfn_to_page(smfn)->v.sh.back);
            -            mfn_to_page(gmfn)->shadow_flags |= SHF_pagetable_dying;
            +            mfn_to_page(gmfn)->pagetable_dying = 1;
                         shadow_unhook_mappings(d, smfn, 1/* user pages only */);
                         flush = 1;
                     }
            @@ -4615,9 +4615,9 @@ static void sh_pagetable_dying(struct vcpu *v, paddr_t gpa)
                 smfn = shadow_hash_lookup(d, mfn_x(gmfn), SH_type_l4_64_shadow);
             #endif
             
            -    if ( mfn_valid(smfn) )
            +    if ( mfn_valid(smfn) && is_hvm_domain(d) )
                 {
            -        mfn_to_page(gmfn)->shadow_flags |= SHF_pagetable_dying;
            +        mfn_to_page(gmfn)->pagetable_dying = 1;
                     shadow_unhook_mappings(d, smfn, 1/* user pages only */);
                     /* Now flush the TLB: we removed toplevel mappings. */
                     flush_tlb_mask(d->domain_dirty_cpumask);
            

            Check your yum logs for last updates on hypervisor
            # cat /var/log/yum.log* | grep xen-hypervisor

            1 Reply Last reply Reply Quote 0
            • borzelB Offline
              borzel XCP-ng Center Team
              last edited by borzel

              It's a fresh installed XCP-ng 7.6 (last week), did a yum update after install. Weird .... I look and come back in a minute.

              yum list installed | grep xen-hyper
              xen-hypervisor.x86_64             4.7.6-6.3.1.xcp           @xcp-ng-updates
              

              hmm......🤔

              1 Reply Last reply Reply Quote 0
              • borzelB Offline
                borzel XCP-ng Center Team
                last edited by

                Ok, it's clear now!

                I installed all updates, but did not reboot this host 😞 I can see this on my uptime graph and the install time:

                cat /var/log/yum.log* | grep xen-hypervisor
                Feb 18 14:35:33 Installed: xen-hypervisor-4.7.6-6.2.1.xcp.x86_64
                Feb 24 02:34:04 Updated: xen-hypervisor-4.7.6-6.3.1.xcp.x86_64
                

                Host did not reboot after installation of the updates on Feb 24, Crash was in the night from Feb 26 to Feb 27:

                39ca869e-8c46-4e08-ac98-b022666ad196-grafik.png

                @ri many thanks to you!

                1 Reply Last reply Reply Quote 0
                • borzelB Offline
                  borzel XCP-ng Center Team
                  last edited by

                  Crash happened again, this time with Xen version: 4.7.6-6.3.1.xcp

                  Let me wake up fully 🛌 ... Details will follow....

                  1 Reply Last reply Reply Quote 0
                  • R Offline
                    r1 XCP-ng Team
                    last edited by

                    Woah! Something serious.. Eager.

                    borzelB 1 Reply Last reply Reply Quote 0
                    • borzelB Offline
                      borzel XCP-ng Center Team @r1
                      last edited by

                      @r1 hhm.....

                      1 Reply Last reply Reply Quote 0
                      • S Offline
                        Steve_Sibilia
                        last edited by

                        Exact same situation here.
                        2 different host both crashed, I suspect is one of the vm because I've moved them from the first one that crashed to second one.
                        Let me know if you found something to address this issue.
                        Many thanks.

                        1 Reply Last reply Reply Quote 0
                        • R Offline
                          r1 XCP-ng Team
                          last edited by

                          @Steve_Sibilia @borzel I think you need to update hosts. https://xcp-ng.org/forum/post/9343

                          1 Reply Last reply Reply Quote 0
                          • S Offline
                            Steve_Sibilia
                            last edited by

                            I already updated it.
                            [root@vivhv11 20190301-211627-CET]# rpm -q openvswitch
                            openvswitch-2.5.3-2.2.3.3.xcpng.x86_64
                            and it crashed afterward.

                            1 Reply Last reply Reply Quote 0
                            • R Offline
                              r1 XCP-ng Team
                              last edited by

                              Ok.. so its not related then.

                              Do you have other leads - such as crash logs or kern.log to guess?

                              1 Reply Last reply Reply Quote 0
                              • S Offline
                                Steve_Sibilia
                                last edited by

                                Yes,
                                here: https://cloud.sinapto.net/index.php/s/13nUsnunRKmdUBC
                                you can find crash log. For what I can understand it is related to an host which was pv with an older version of the tools (7.2). After last crash I've update tools to the latest version hoping that this will fix the issue.
                                All the crashes (4 so far, on 2 different xcp hosts) happened outside peak hours for our infrastructure (friday evening after business hours and saturday during lunch) so I don't think is load average related but I'm not certain about it.
                                Let me know if you need further informations.
                                Thanks
                                Steve

                                1 Reply Last reply Reply Quote 0
                                • R Offline
                                  r1 XCP-ng Team
                                  last edited by

                                  Guest tools should not be able to crash host - it will be treated as critical bug. Your stack is similar to that of reported earlier. And even though you have updated Xen version, the trigger could be something else...

                                  1 Reply Last reply Reply Quote 0
                                  • DanpD Offline
                                    Danp Pro Support Team
                                    last edited by

                                    FWIW, it appears that my fully patched 7.6 xcp host also rebooted the morning of 2/27

                                    System Booted: 2019-02-27 04:46

                                    I haven't dug into the logs yet.

                                    1 Reply Last reply Reply Quote 0
                                    • R Offline
                                      r1 XCP-ng Team
                                      last edited by

                                      @Steve_Sibilia @Danp @borzel can you share # xenpm get-cpuidle-states and BIOS power saving settings?

                                      S 1 Reply Last reply Reply Quote 0
                                      • DanpD Offline
                                        Danp Pro Support Team
                                        last edited by

                                        Here the cpuidlestates.txt. Unsure how to get the BIOS settings without rebooting.

                                        1 Reply Last reply Reply Quote 0
                                        • S Offline
                                          Steve_Sibilia @r1
                                          last edited by

                                          @r1 yes for sure.
                                          Here: https://cloud.sinapto.net/index.php/s/0X8GdLxnTrw6bmd the output pf get-cpu-idle-states.
                                          Power management is set up for maximun performance, power cap is disabled.
                                          The system is a Dell poweredge r640.
                                          Let me know if you need more details.
                                          Steve

                                          1 Reply Last reply Reply Quote 0
                                          • R Offline
                                            r1 XCP-ng Team
                                            last edited by

                                            @stormi

                                            While digging on multiple things I'm curious on one patch

                                            https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=9dc1e0cd81ee469d638d1962a92d9b4bd2972bfa

                                            It seems to have been applied on

                                            1. master
                                              https://xenbits.xen.org/gitweb/?p=xen.git;a=history;f=xen/arch/x86/mm/shadow/multi.c;h=7dc39d75651e0bd85fe9a4b5c6c586648d7a7ab2;hb=refs/heads/master

                                            2. stable-4.9 https://xenbits.xen.org/gitweb/?p=xen.git;a=history;f=xen/arch/x86/mm/shadow/multi.c;h=7dc39d75651e0bd85fe9a4b5c6c586648d7a7ab2;hb=refs/heads/stable-4.9

                                            But its not there on

                                            1. stable-4.8 https://xenbits.xen.org/gitweb/?p=xen.git;a=history;f=xen/arch/x86/mm/shadow/multi.c;h=7dc39d75651e0bd85fe9a4b5c6c586648d7a7ab2;hb=refs/heads/stable-4.8

                                            and

                                            1. stable-4.7 https://xenbits.xen.org/gitweb/?p=xen.git;a=history;f=xen/arch/x86/mm/shadow/multi.c;h=7dc39d75651e0bd85fe9a4b5c6c586648d7a7ab2;hb=refs/heads/stable-4.7

                                            And I think this could have relation of the host crash mentioned above. The function guest_walk_to_gfn is used in stable-4.7 but not at places mentioned by this specific patch.

                                            Will try to find more on this, my tomorrow.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post