XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng host restarts at random intervals

    Scheduled Pinned Locked Moved Compute
    24 Posts 7 Posters 5.3k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      The /var/crash folder might also being interesting (Dom0.log and Xen log to see who is triggering the crash)

      1 Reply Last reply Reply Quote 0
      • christopher-petzelC Offline
        christopher-petzel
        last edited by

        I believe I have the definitive cause for this 'random host reboot' issue.

        After 6 months of problem-free operation, I have experienced the host reboot issue again on this server. The host was running only Linux VMs, so the theory of Windows VMs on the host contributing to the reboot issue has proven false. As with each time before, there are no indications in any relevant log files that the host is going to reboot. I think at this point I can definitively say that the reboot is caused by a faulty SuperMicro motherboard.

        I've learned my lesson: use HPE servers! This SuperMicro system will be melted down for scrap.

        1 Reply Last reply Reply Quote 1
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          Thanks for the feedback 🙂

          Well, at least keep us posted if you have the same issue with another hardware, we'll be happy to help 🙂

          1 Reply Last reply Reply Quote 0
          • C Offline
            Chmura
            last edited by Chmura

            Hi @olivierlambert
            Now, I have the same problem on 4 servers. Machines reset every few hours!!! Please HELP.

            The machines have been running stably since:

            reboot system boot 4.19.0+1 Wed Dec 28 12:30 - 05:50 (217+16:19)
            

            Since then, the following patches have been installed but not restarted:

            May 16 09:07:40 Updated: xen-libs-4.13.5-9.30.3.xcpng8.2.x86_64
            May 16 09:07:41 Updated: guest-templates-json-1.9.6-1.2.xcpng8.2.noarch
            May 16 09:07:41 Updated: xcp-ng-release-presets-8.2.1-6.x86_64
            May 16 09:07:41 Updated: xen-hypervisor-4.13.5-9.30.3.xcpng8.2.x86_64
            May 16 09:07:42 Updated: xen-dom0-libs-4.13.5-9.30.3.xcpng8.2.x86_64
            May 16 09:07:43 Updated: xen-tools-4.13.5-9.30.3.xcpng8.2.x86_64
            May 16 09:07:44 Updated: xen-dom0-tools-4.13.5-9.30.3.xcpng8.2.x86_64
            May 16 09:07:48 Updated: xcp-ng-release-config-8.2.1-6.x86_64
            May 16 09:07:49 Updated: xcp-ng-release-8.2.1-6.x86_64
            May 16 09:07:49 Updated: guest-templates-json-data-other-1.9.6-1.2.xcpng8.2.noarch
            May 16 09:07:50 Updated: guest-templates-json-data-linux-1.9.6-1.2.xcpng8.2.noarch
            May 16 09:07:50 Updated: guest-templates-json-data-windows-1.9.6-1.2.xcpng8.2.noarch
            May 16 09:07:51 Updated: sudo-1.8.23-10.el7_9.3.x86_64
            May 16 09:08:01 Updated: linux-firmware-20190314-5.1.xcpng8.2.noarch
            May 16 09:08:03 Updated: 2:microcode_ctl-2.1-26.xs23.1.xcpng8.2.x86_64
            May 29 06:57:47 Updated: xen-libs-4.13.5-9.31.1.xcpng8.2.x86_64
            May 29 06:57:48 Updated: xcp-ng-release-presets-8.2.1-9.x86_64
            May 29 06:57:49 Updated: message-switch-1.23.2-4.1.xcpng8.2.x86_64
            May 29 06:57:50 Updated: forkexecd-1.18.1-2.1.xcpng8.2.x86_64
            May 29 06:57:50 Updated: xen-hypervisor-4.13.5-9.31.1.xcpng8.2.x86_64
            May 29 06:57:51 Updated: xen-dom0-libs-4.13.5-9.31.1.xcpng8.2.x86_64
            May 29 06:57:56 Updated: 2:qemu-4.2.1-4.6.3.1.xcpng8.2.x86_64
            May 29 06:58:00 Updated: xen-tools-4.13.5-9.31.1.xcpng8.2.x86_64
            May 29 06:58:01 Updated: xen-dom0-tools-4.13.5-9.31.1.xcpng8.2.x86_64
            May 29 06:58:03 Updated: xenopsd-0.150.14-1.1.xcpng8.2.x86_64
            May 29 06:58:03 Updated: xenopsd-cli-0.150.14-1.1.xcpng8.2.x86_64
            May 29 06:58:05 Updated: xenopsd-xc-0.150.14-1.1.xcpng8.2.x86_64
            May 29 06:58:06 Updated: gpumon-0.18.0-4.3.xcpng8.2.x86_64
            May 29 06:58:06 Updated: xcp-rrdd-1.33.2-1.1.xcpng8.2.x86_64
            May 29 06:58:08 Updated: rrdd-plugins-1.10.8-5.2.xcpng8.2.x86_64
            May 29 06:58:09 Updated: xapi-tests-1.249.28-1.2.xcpng8.2.x86_64
            May 29 06:58:13 Updated: xapi-core-1.249.28-1.2.xcpng8.2.x86_64
            May 29 06:58:16 Updated: sm-2.30.8-2.1.xcpng8.2.x86_64
            May 29 06:58:20 Updated: xcp-ng-release-config-8.2.1-9.x86_64
            May 29 06:58:21 Updated: xcp-ng-release-8.2.1-9.x86_64
            May 29 06:58:22 Updated: 2:microcode_ctl-2.1-26.xs25.1.xcpng8.2.x86_64
            May 29 06:58:28 Updated: linux-firmware-20190314-7.1.xcpng8.2.noarch
            May 29 06:58:33 Updated: xapi-xe-1.249.28-1.2.xcpng8.2.x86_64
            May 29 06:58:34 Updated: varstored-guard-0.6.2-2.xcpng8.2.x86_64
            May 29 06:58:35 Updated: xcp-networkd-0.56.2-2.xcpng8.2.x86_64
            May 29 06:58:36 Updated: sm-rawhba-2.30.8-2.1.xcpng8.2.x86_64
            Jul 28 10:10:40 Updated: xen-libs-4.13.5-9.34.1.xcpng8.2.x86_64
            Jul 28 10:10:41 Updated: xen-hypervisor-4.13.5-9.34.1.xcpng8.2.x86_64
            Jul 28 10:10:42 Updated: xen-dom0-libs-4.13.5-9.34.1.xcpng8.2.x86_64
            Jul 28 10:10:42 Updated: xen-tools-4.13.5-9.34.1.xcpng8.2.x86_64
            Jul 28 10:10:44 Updated: xen-dom0-tools-4.13.5-9.34.1.xcpng8.2.x86_64
            Jul 28 10:10:54 Updated: linux-firmware-20190314-8.1.xcpng8.2.noarch
            

            Yesterday morning at 5:30 to 5:50 I reset the all servers (zenbleed patch), since then i have random reboots on all 4 servers.

            server1: 2x AMD EPYC 7282, ASUS Mainboard

            reboot   system boot  4.19.0+1         Thu Aug  3 10:57 - 13:25 (1+02:27)   
            reboot   system boot  4.19.0+1         Thu Aug  3 07:33 - 13:25 (1+05:51)
            reboot   system boot  4.19.0+1         Thu Aug  3 05:57 - 13:25 (1+07:27)   
            reboot   system boot  4.19.0+1         Thu Aug  3 05:36 - 13:25 (1+07:48)
            

            serwer2: 2x AMD EPYC 7282, ASUS Mainboard

            reboot   system boot  4.19.0+1         Fri Aug  4 13:07 - 13:25  (00:18)    
            reboot   system boot  4.19.0+1         Fri Aug  4 00:21 - 13:25  (13:04)    
            reboot   system boot  4.19.0+1         Thu Aug  3 07:51 - 13:25 (1+05:34)
            reboot   system boot  4.19.0+1         Thu Aug  3 05:55 - 13:25 (1+07:30)   
            

            Server3: 2x AMD EPYC 7282, Supermicro Mainboard

            reboot   system boot  4.19.0+1         Fri Aug  4 13:07 - 13:14  (00:06)    
            reboot   system boot  4.19.0+1         Fri Aug  4 00:21 - 13:14  (12:53)    
            reboot   system boot  4.19.0+1         Thu Aug  3 07:51 - 13:14 (1+05:23)   
            reboot   system boot  4.19.0+1         Thu Aug  3 05:55 - 13:14 (1+07:19)
            

            server4: 2x AMD EPYC 7282, Supermicro Mainboard

            reboot   system boot  4.19.0+1         Fri Aug  4 00:33 - 13:26  (12:52)    
            reboot   system boot  4.19.0+1         Thu Aug  3 05:46 - 13:26 (1+07:40)
            

            What can I provide you to solve the problem.

            Hardware issues ruled out, power supply also OK (2 power supplies, 2 independent outlets).

            In /var/crash i have old file

            ls -al /var/crash/
            -rw-r--r--  1 root root 67108864 2022-12-28  .sacrificial-space-for-logs
            

            When one server restarted, I catch It and that was a full machine restart POST BIOS.

            Please help

            DanpD C 2 Replies Last reply Reply Quote 0
            • DanpD Online
              Danp Pro Support Team @Chmura
              last edited by

              @Chmura There's a pending fix for a problem with the zenbleed patch. You may want to test it out to see if it resolves your rebooting issue. See here for more details.

              1 Reply Last reply Reply Quote 0
              • C Offline
                Chmura @Chmura
                last edited by Chmura

                @Danp said in XCP-ng host restarts at random intervals:

                @Chmura There's a pending fix for a problem with the zenbleed patch. You may want to test it out to see if it resolves your rebooting issue. See here for more details.

                Thanks for fast reply.

                Now for test on serwer3 i downgrade this package:

                yum downgrade linux-firmware-20190314-5.1.xcpng8.2.noarch
                

                And I will test stability.

                On serwer4 i downgrade all packages to my 27.12.2022 state:

                xen-libs-4.13.4-9.28.1.xcpng8.2.x86_64
                message-switch-1.23.2-3.2.xcpng8.2.x86_64
                forkexecd-1.18.1-1.1.xcpng8.2.x86_64
                vhd-tool-0.43.0-4.1.xcpng8.2.x86_64
                1:xs-openssl-libs-1.1.1k-6.1.xcpng8.2.x86_64
                xen-hypervisor-4.13.4-9.28.1.xcpng8.2.x86_64
                xen-dom0-libs-4.13.4-9.28.1.xcpng8.2.x86_64
                2:qemu-4.2.1-4.6.2.1.xcpng8.2.x86_64
                xen-tools-4.13.4-9.28.1.xcpng8.2.x86_64
                edk2-20180522git4b8552d-1.4.6.xcpng8.2.x86_64
                xen-dom0-tools-4.13.4-9.28.1.xcpng8.2.x86_64
                xenopsd-0.150.12-1.2.xcpng8.2.x86_64
                xenopsd-xc-0.150.12-1.2.xcpng8.2.x86_64
                xenopsd-cli-0.150.12-1.2.xcpng8.2.x86_64
                xcp-rrdd-1.33.0-6.1.xcpng8.2.x86_64
                squeezed-0.27.0-5.xcpng8.2.x86_64
                rrdd-plugins-1.10.8-5.1.xcpng8.2.x86_64
                gpumon-0.18.0-4.2.xcpng8.2.x86_64
                xapi-tests-1.249.26-2.1.xcpng8.2.x86_64
                blktap-3.37.4-1.0.1.xcpng8.2.x86_64
                xapi-core-1.249.26-2.1.xcpng8.2.x86_64
                2:microcode_ctl-2.1-26.xs23.xcpng8.2.x86_64
                sm-rawhba-2.30.7-1.3.xcpng8.2.x86_64
                rrd2csv-1.2.5-7.1.xcpng8.2.x86_64
                kernel-4.19.19-7.0.15.1.xcpng8.2.x86_64
                xapi-xe-1.249.26-2.1.xcpng8.2.x86_64
                xcp-networkd-0.56.2-1.xcpng8.2.x86_64
                openvswitch-2.5.3-2.3.12.1.xcpng8.2.x86_64
                xapi-storage-script-0.34.1-2.1.xcpng8.2.x86_64
                varstored-guard-0.6.2-1.xcpng8.2.x86_64
                sm-2.30.7-1.3.xcpng8.2.x86_64
                sm-cli-0.23.0-7.xcpng8.2.x86_64
                xcp-ng-xapi-plugins-1.7.2-1.xcpng8.2.noarch
                linux-firmware-20190314-5.xcpng8.2.noarch
                xapi-nbd-1.11.0-3.2.xcpng8.2.x86_64
                xcp-ng-pv-tools-8.2.0-11.xcpng8.2.noarch
                

                Now I will evacuate all VMs from server2 to server3/4 and check the microcode package from xcp-ng-testing repo.
                We'll see what comes out when i use yum update "xen-*" --enablerepo=xcp-ng-testing
                Funny weekend 🙂

                Edit: Server3 was restarted at 9PM ;(
                Server 4 and update Server2 (xen-... 4.13.5-9.35.1.xcp ng 8.2) still working,

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  We have released new patches last Friday, double check to be fully up to date and reboot 🙂

                  1 Reply Last reply Reply Quote 0
                  • tjkreidlT Online
                    tjkreidl Ambassador @christopher-petzel
                    last edited by

                    @christopher-petzel Sometimes this can happen if the host and VMs do not use the same NTPP server(s) or are not syncing properly with them, and therefore fail to keep the times all properly synchronized. I'd check to make sure all are in sync.

                    1 Reply Last reply Reply Quote 0
                    • T Offline
                      Toni
                      last edited by Toni

                      I also have the reboot problem.
                      It only occurs for me when a USB hard drive is connected.
                      If there is no hard drive connected to the USB interfaces, the system will run stable for months / years.

                      Today I had to read data into a VM via USB hard drive, so I connected one.
                      I haven't had any problems during the data transfer so far. Only when the USB hard drive is no longer used and is still connected..

                      Please check whether something is connected via USB on the systems that have the reboot problem.

                      I've had this problem on other systems too. But it never bothered me because I don't normally have anything connected to the USB ports.
                      I also had the reboot problem in earlier versions of xcp-ng or Xenserver and it also occurred on HP servers that I used before I switched to Supermicro.
                      I have been using Xenserver since version 5.0.


                      Mainboard: Supermicro H11SSL-i BIOS 2.4
                      CPU: Epyc 7551P

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        @Toni said in XCP-ng host restarts at random intervals:

                        Please check whether something is connected via USB on the systems that have the reboot problem.

                        Hi! It's a community forum here 🙂 So it's a bit more up to you to demonstrate the bug by investigate a bit more and digging the logs, otherwise if you want an investigation on your setup, then pro support is more fit. If you want to take a look on what logs to check, take a look at https://docs.xcp-ng.org/troubleshooting/

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post