XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng host restarts at random intervals

    Scheduled Pinned Locked Moved Compute
    24 Posts 7 Posters 5.3k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates πŸͺ Co-Founder CEO
      last edited by

      Hmm interesting. Have you took a look on Xen side of things in terms of logs?

      christopher-petzelC 1 Reply Last reply Reply Quote 0
      • christopher-petzelC Offline
        christopher-petzel @olivierlambert
        last edited by

        I was wrong about the hypervisor, it is restarting. I confused myself and didn't make the connection.

        In /var/log/xen/hypervisor.log... I see an entry Logfile Opened with the timestamp of when the log rotates then I see another Logfile Opened at the timestamp that the hypervisor restarts, followed by the Xen log data during boot.

        So I guess I need to be thinking about why the hypervisor is restarting. Now I'm questioning if the hardware is restarting. I have not seen a hardware restart in the IPMI data and the recovery time seemed too short for a hardware restart HOWEVER the lack of evidence is not evidence itself so I think my next move will be to monitor the hardware in a way that I can confirm or deny a hardware restart.

        Thanks for your help @olivierlambert . It may be a couple of months before this happens again but I'll report back what I find once it happens.

        christopher-petzelC 1 Reply Last reply Reply Quote 0
        • christopher-petzelC Offline
          christopher-petzel @christopher-petzel
          last edited by

          @olivierlambert I have been able to confirm this is a hardware reboot. Since I've been working this issue for a year and the restarts were so rare, at some point I convinced myself that the hardware was not restarting even thought my monitoring and logging was telling me otherwise.

          Thanks for your help in guiding me to reconsider what I thought I already knew. Thankfully the restarts have become more frequent and I have had 3 reboots in 10 days. That frequency has allowed me to catch what was really happening.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates πŸͺ Co-Founder CEO
            last edited by

            Ah "great" news then πŸ™‚ Is there anything else we can do to help?

            christopher-petzelC 1 Reply Last reply Reply Quote 0
            • christopher-petzelC Offline
              christopher-petzel @olivierlambert
              last edited by

              @olivierlambert Just tell people to stick with HP hardware πŸ˜„ This problem server is a SuperMicro system board and it's the second of the same model of which I've had a hardware problem. The other board stopped working completely so it was a different failure mode. Once I obsolete this hardware, I will have no more SuperMicro boards in production.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates πŸͺ Co-Founder CEO
                last edited by

                I'm not entirely surprised (we tell people to use Dell or HPE). Sometimes there's a bit of lottery for Supermicro, but we also know hosting companies using SM at scale without problem…

                1 Reply Last reply Reply Quote 0
                • christopher-petzelC Offline
                  christopher-petzel
                  last edited by

                  Since I last posted on this topic, I've found that the random reboots only occur when there are Windows Server VMs on the host (Tested with 2019 and 2022). The issue will not occur when running Linux VMs.

                  My issue seems very similar to the problem described (and solved) in https://xcp-ng.org/forum/topic/6683/windows-server-2019-sporadic-reboot/7

                  The difference is that in my case, the host restarted and in the other post, the poster reports that the VMs are restarting. Since the poster also tested RAM and found no problems but was able to solve the issue by replacing a suspected DIMM, that information may be useful in the host reboot scenario that I experience.

                  FYI, I have not replaced the RAM yet and may not actually do it since the server in question is aging and will likely be replaced (with HP hardware) soon.

                  tjkreidlT 1 Reply Last reply Reply Quote 1
                  • olivierlambertO Offline
                    olivierlambert Vates πŸͺ Co-Founder CEO
                    last edited by

                    Thanks for keeping us posted πŸ™‚

                    1 Reply Last reply Reply Quote 0
                    • splastunovS Offline
                      splastunov
                      last edited by

                      Hello!

                      Are all VMs on this host belongs to you and you certainly know what processes running on them?

                      I had same issue with Dell R630.
                      The solution was to update to latest BIOS.
                      I think that some clients ran some software that triggered some bug and host rebooted.

                      XCP-ng security updates does not helped.
                      In my case only BIOS update fixed suddenly crushes.

                      So the work around will be to move VMs one by one to another host and check if it will solve the problem.

                      christopher-petzelC 1 Reply Last reply Reply Quote 0
                      • christopher-petzelC Offline
                        christopher-petzel @splastunov
                        last edited by

                        @splastunov Yes, all VMs are for in-house use and all were built by me personally.

                        I have previously followed the same steps that you followed in your case. I updated the BIOS on the host server and moved VMs one by one.

                        Moving VMs one by one is how I eventually found that I only had the problem when a Windows Server VM was on the host. When I had this problem occur with a fresh Windows Server 2022 VM which had no applications installed, I started to suspect that it was related to Windows. I was then able to confirm that this only occurred with Windows VMs.

                        Thanks for the info. I think these are great steps toward finding the problem.

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates πŸͺ Co-Founder CEO
                          last edited by

                          The /var/crash folder might also being interesting (Dom0.log and Xen log to see who is triggering the crash)

                          1 Reply Last reply Reply Quote 0
                          • christopher-petzelC Offline
                            christopher-petzel
                            last edited by

                            I believe I have the definitive cause for this 'random host reboot' issue.

                            After 6 months of problem-free operation, I have experienced the host reboot issue again on this server. The host was running only Linux VMs, so the theory of Windows VMs on the host contributing to the reboot issue has proven false. As with each time before, there are no indications in any relevant log files that the host is going to reboot. I think at this point I can definitively say that the reboot is caused by a faulty SuperMicro motherboard.

                            I've learned my lesson: use HPE servers! This SuperMicro system will be melted down for scrap.

                            1 Reply Last reply Reply Quote 1
                            • olivierlambertO Offline
                              olivierlambert Vates πŸͺ Co-Founder CEO
                              last edited by

                              Thanks for the feedback πŸ™‚

                              Well, at least keep us posted if you have the same issue with another hardware, we'll be happy to help πŸ™‚

                              1 Reply Last reply Reply Quote 0
                              • C Offline
                                Chmura
                                last edited by Chmura

                                Hi @olivierlambert
                                Now, I have the same problem on 4 servers. Machines reset every few hours!!! Please HELP.

                                The machines have been running stably since:

                                reboot system boot 4.19.0+1 Wed Dec 28 12:30 - 05:50 (217+16:19)
                                

                                Since then, the following patches have been installed but not restarted:

                                May 16 09:07:40 Updated: xen-libs-4.13.5-9.30.3.xcpng8.2.x86_64
                                May 16 09:07:41 Updated: guest-templates-json-1.9.6-1.2.xcpng8.2.noarch
                                May 16 09:07:41 Updated: xcp-ng-release-presets-8.2.1-6.x86_64
                                May 16 09:07:41 Updated: xen-hypervisor-4.13.5-9.30.3.xcpng8.2.x86_64
                                May 16 09:07:42 Updated: xen-dom0-libs-4.13.5-9.30.3.xcpng8.2.x86_64
                                May 16 09:07:43 Updated: xen-tools-4.13.5-9.30.3.xcpng8.2.x86_64
                                May 16 09:07:44 Updated: xen-dom0-tools-4.13.5-9.30.3.xcpng8.2.x86_64
                                May 16 09:07:48 Updated: xcp-ng-release-config-8.2.1-6.x86_64
                                May 16 09:07:49 Updated: xcp-ng-release-8.2.1-6.x86_64
                                May 16 09:07:49 Updated: guest-templates-json-data-other-1.9.6-1.2.xcpng8.2.noarch
                                May 16 09:07:50 Updated: guest-templates-json-data-linux-1.9.6-1.2.xcpng8.2.noarch
                                May 16 09:07:50 Updated: guest-templates-json-data-windows-1.9.6-1.2.xcpng8.2.noarch
                                May 16 09:07:51 Updated: sudo-1.8.23-10.el7_9.3.x86_64
                                May 16 09:08:01 Updated: linux-firmware-20190314-5.1.xcpng8.2.noarch
                                May 16 09:08:03 Updated: 2:microcode_ctl-2.1-26.xs23.1.xcpng8.2.x86_64
                                May 29 06:57:47 Updated: xen-libs-4.13.5-9.31.1.xcpng8.2.x86_64
                                May 29 06:57:48 Updated: xcp-ng-release-presets-8.2.1-9.x86_64
                                May 29 06:57:49 Updated: message-switch-1.23.2-4.1.xcpng8.2.x86_64
                                May 29 06:57:50 Updated: forkexecd-1.18.1-2.1.xcpng8.2.x86_64
                                May 29 06:57:50 Updated: xen-hypervisor-4.13.5-9.31.1.xcpng8.2.x86_64
                                May 29 06:57:51 Updated: xen-dom0-libs-4.13.5-9.31.1.xcpng8.2.x86_64
                                May 29 06:57:56 Updated: 2:qemu-4.2.1-4.6.3.1.xcpng8.2.x86_64
                                May 29 06:58:00 Updated: xen-tools-4.13.5-9.31.1.xcpng8.2.x86_64
                                May 29 06:58:01 Updated: xen-dom0-tools-4.13.5-9.31.1.xcpng8.2.x86_64
                                May 29 06:58:03 Updated: xenopsd-0.150.14-1.1.xcpng8.2.x86_64
                                May 29 06:58:03 Updated: xenopsd-cli-0.150.14-1.1.xcpng8.2.x86_64
                                May 29 06:58:05 Updated: xenopsd-xc-0.150.14-1.1.xcpng8.2.x86_64
                                May 29 06:58:06 Updated: gpumon-0.18.0-4.3.xcpng8.2.x86_64
                                May 29 06:58:06 Updated: xcp-rrdd-1.33.2-1.1.xcpng8.2.x86_64
                                May 29 06:58:08 Updated: rrdd-plugins-1.10.8-5.2.xcpng8.2.x86_64
                                May 29 06:58:09 Updated: xapi-tests-1.249.28-1.2.xcpng8.2.x86_64
                                May 29 06:58:13 Updated: xapi-core-1.249.28-1.2.xcpng8.2.x86_64
                                May 29 06:58:16 Updated: sm-2.30.8-2.1.xcpng8.2.x86_64
                                May 29 06:58:20 Updated: xcp-ng-release-config-8.2.1-9.x86_64
                                May 29 06:58:21 Updated: xcp-ng-release-8.2.1-9.x86_64
                                May 29 06:58:22 Updated: 2:microcode_ctl-2.1-26.xs25.1.xcpng8.2.x86_64
                                May 29 06:58:28 Updated: linux-firmware-20190314-7.1.xcpng8.2.noarch
                                May 29 06:58:33 Updated: xapi-xe-1.249.28-1.2.xcpng8.2.x86_64
                                May 29 06:58:34 Updated: varstored-guard-0.6.2-2.xcpng8.2.x86_64
                                May 29 06:58:35 Updated: xcp-networkd-0.56.2-2.xcpng8.2.x86_64
                                May 29 06:58:36 Updated: sm-rawhba-2.30.8-2.1.xcpng8.2.x86_64
                                Jul 28 10:10:40 Updated: xen-libs-4.13.5-9.34.1.xcpng8.2.x86_64
                                Jul 28 10:10:41 Updated: xen-hypervisor-4.13.5-9.34.1.xcpng8.2.x86_64
                                Jul 28 10:10:42 Updated: xen-dom0-libs-4.13.5-9.34.1.xcpng8.2.x86_64
                                Jul 28 10:10:42 Updated: xen-tools-4.13.5-9.34.1.xcpng8.2.x86_64
                                Jul 28 10:10:44 Updated: xen-dom0-tools-4.13.5-9.34.1.xcpng8.2.x86_64
                                Jul 28 10:10:54 Updated: linux-firmware-20190314-8.1.xcpng8.2.noarch
                                

                                Yesterday morning at 5:30 to 5:50 I reset the all servers (zenbleed patch), since then i have random reboots on all 4 servers.

                                server1: 2x AMD EPYC 7282, ASUS Mainboard

                                reboot   system boot  4.19.0+1         Thu Aug  3 10:57 - 13:25 (1+02:27)   
                                reboot   system boot  4.19.0+1         Thu Aug  3 07:33 - 13:25 (1+05:51)
                                reboot   system boot  4.19.0+1         Thu Aug  3 05:57 - 13:25 (1+07:27)   
                                reboot   system boot  4.19.0+1         Thu Aug  3 05:36 - 13:25 (1+07:48)
                                

                                serwer2: 2x AMD EPYC 7282, ASUS Mainboard

                                reboot   system boot  4.19.0+1         Fri Aug  4 13:07 - 13:25  (00:18)    
                                reboot   system boot  4.19.0+1         Fri Aug  4 00:21 - 13:25  (13:04)    
                                reboot   system boot  4.19.0+1         Thu Aug  3 07:51 - 13:25 (1+05:34)
                                reboot   system boot  4.19.0+1         Thu Aug  3 05:55 - 13:25 (1+07:30)   
                                

                                Server3: 2x AMD EPYC 7282, Supermicro Mainboard

                                reboot   system boot  4.19.0+1         Fri Aug  4 13:07 - 13:14  (00:06)    
                                reboot   system boot  4.19.0+1         Fri Aug  4 00:21 - 13:14  (12:53)    
                                reboot   system boot  4.19.0+1         Thu Aug  3 07:51 - 13:14 (1+05:23)   
                                reboot   system boot  4.19.0+1         Thu Aug  3 05:55 - 13:14 (1+07:19)
                                

                                server4: 2x AMD EPYC 7282, Supermicro Mainboard

                                reboot   system boot  4.19.0+1         Fri Aug  4 00:33 - 13:26  (12:52)    
                                reboot   system boot  4.19.0+1         Thu Aug  3 05:46 - 13:26 (1+07:40)
                                

                                What can I provide you to solve the problem.

                                Hardware issues ruled out, power supply also OK (2 power supplies, 2 independent outlets).

                                In /var/crash i have old file

                                ls -al /var/crash/
                                -rw-r--r--  1 root root 67108864 2022-12-28  .sacrificial-space-for-logs
                                

                                When one server restarted, I catch It and that was a full machine restart POST BIOS.

                                Please help

                                DanpD C 2 Replies Last reply Reply Quote 0
                                • DanpD Online
                                  Danp Pro Support Team @Chmura
                                  last edited by

                                  @Chmura There's a pending fix for a problem with the zenbleed patch. You may want to test it out to see if it resolves your rebooting issue. See here for more details.

                                  1 Reply Last reply Reply Quote 0
                                  • C Offline
                                    Chmura @Chmura
                                    last edited by Chmura

                                    @Danp said in XCP-ng host restarts at random intervals:

                                    @Chmura There's a pending fix for a problem with the zenbleed patch. You may want to test it out to see if it resolves your rebooting issue. See here for more details.

                                    Thanks for fast reply.

                                    Now for test on serwer3 i downgrade this package:

                                    yum downgrade linux-firmware-20190314-5.1.xcpng8.2.noarch
                                    

                                    And I will test stability.

                                    On serwer4 i downgrade all packages to my 27.12.2022 state:

                                    xen-libs-4.13.4-9.28.1.xcpng8.2.x86_64
                                    message-switch-1.23.2-3.2.xcpng8.2.x86_64
                                    forkexecd-1.18.1-1.1.xcpng8.2.x86_64
                                    vhd-tool-0.43.0-4.1.xcpng8.2.x86_64
                                    1:xs-openssl-libs-1.1.1k-6.1.xcpng8.2.x86_64
                                    xen-hypervisor-4.13.4-9.28.1.xcpng8.2.x86_64
                                    xen-dom0-libs-4.13.4-9.28.1.xcpng8.2.x86_64
                                    2:qemu-4.2.1-4.6.2.1.xcpng8.2.x86_64
                                    xen-tools-4.13.4-9.28.1.xcpng8.2.x86_64
                                    edk2-20180522git4b8552d-1.4.6.xcpng8.2.x86_64
                                    xen-dom0-tools-4.13.4-9.28.1.xcpng8.2.x86_64
                                    xenopsd-0.150.12-1.2.xcpng8.2.x86_64
                                    xenopsd-xc-0.150.12-1.2.xcpng8.2.x86_64
                                    xenopsd-cli-0.150.12-1.2.xcpng8.2.x86_64
                                    xcp-rrdd-1.33.0-6.1.xcpng8.2.x86_64
                                    squeezed-0.27.0-5.xcpng8.2.x86_64
                                    rrdd-plugins-1.10.8-5.1.xcpng8.2.x86_64
                                    gpumon-0.18.0-4.2.xcpng8.2.x86_64
                                    xapi-tests-1.249.26-2.1.xcpng8.2.x86_64
                                    blktap-3.37.4-1.0.1.xcpng8.2.x86_64
                                    xapi-core-1.249.26-2.1.xcpng8.2.x86_64
                                    2:microcode_ctl-2.1-26.xs23.xcpng8.2.x86_64
                                    sm-rawhba-2.30.7-1.3.xcpng8.2.x86_64
                                    rrd2csv-1.2.5-7.1.xcpng8.2.x86_64
                                    kernel-4.19.19-7.0.15.1.xcpng8.2.x86_64
                                    xapi-xe-1.249.26-2.1.xcpng8.2.x86_64
                                    xcp-networkd-0.56.2-1.xcpng8.2.x86_64
                                    openvswitch-2.5.3-2.3.12.1.xcpng8.2.x86_64
                                    xapi-storage-script-0.34.1-2.1.xcpng8.2.x86_64
                                    varstored-guard-0.6.2-1.xcpng8.2.x86_64
                                    sm-2.30.7-1.3.xcpng8.2.x86_64
                                    sm-cli-0.23.0-7.xcpng8.2.x86_64
                                    xcp-ng-xapi-plugins-1.7.2-1.xcpng8.2.noarch
                                    linux-firmware-20190314-5.xcpng8.2.noarch
                                    xapi-nbd-1.11.0-3.2.xcpng8.2.x86_64
                                    xcp-ng-pv-tools-8.2.0-11.xcpng8.2.noarch
                                    

                                    Now I will evacuate all VMs from server2 to server3/4 and check the microcode package from xcp-ng-testing repo.
                                    We'll see what comes out when i use yum update "xen-*" --enablerepo=xcp-ng-testing
                                    Funny weekend πŸ™‚

                                    Edit: Server3 was restarted at 9PM ;(
                                    Server 4 and update Server2 (xen-... 4.13.5-9.35.1.xcp ng 8.2) still working,

                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates πŸͺ Co-Founder CEO
                                      last edited by

                                      We have released new patches last Friday, double check to be fully up to date and reboot πŸ™‚

                                      1 Reply Last reply Reply Quote 0
                                      • tjkreidlT Offline
                                        tjkreidl Ambassador @christopher-petzel
                                        last edited by

                                        @christopher-petzel Sometimes this can happen if the host and VMs do not use the same NTPP server(s) or are not syncing properly with them, and therefore fail to keep the times all properly synchronized. I'd check to make sure all are in sync.

                                        1 Reply Last reply Reply Quote 0
                                        • T Offline
                                          Toni
                                          last edited by Toni

                                          I also have the reboot problem.
                                          It only occurs for me when a USB hard drive is connected.
                                          If there is no hard drive connected to the USB interfaces, the system will run stable for months / years.

                                          Today I had to read data into a VM via USB hard drive, so I connected one.
                                          I haven't had any problems during the data transfer so far. Only when the USB hard drive is no longer used and is still connected..

                                          Please check whether something is connected via USB on the systems that have the reboot problem.

                                          I've had this problem on other systems too. But it never bothered me because I don't normally have anything connected to the USB ports.
                                          I also had the reboot problem in earlier versions of xcp-ng or Xenserver and it also occurred on HP servers that I used before I switched to Supermicro.
                                          I have been using Xenserver since version 5.0.


                                          Mainboard: Supermicro H11SSL-i BIOS 2.4
                                          CPU: Epyc 7551P

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates πŸͺ Co-Founder CEO
                                            last edited by

                                            @Toni said in XCP-ng host restarts at random intervals:

                                            Please check whether something is connected via USB on the systems that have the reboot problem.

                                            Hi! It's a community forum here πŸ™‚ So it's a bit more up to you to demonstrate the bug by investigate a bit more and digging the logs, otherwise if you want an investigation on your setup, then pro support is more fit. If you want to take a look on what logs to check, take a look at https://docs.xcp-ng.org/troubleshooting/

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post