XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Rolling Pool Update - host took too long to restart

    Scheduled Pinned Locked Moved Xen Orchestra
    36 Posts 9 Posters 11.2k Views 6 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D Offline
      dsiminiuk
      last edited by dsiminiuk

      Rolling pool updates fail because the master is taking too long to restart.
      What is considered too long?
      Perhaps this should be a setting in the server config to override a default.
      In this case, my servers take about 20 minutes to reboot.
      Is there a CI I can adjust?

      pool.rollingUpdate
      {
        "pool": "3cfffa75-69ea-7792-a320-92a7cb33f6f8"
      }
      {
        "message": "Host b725c95c-17af-41ae-a9c5-deeb1b7bfc50 took too long to restart",
        "name": "Error",
        "stack": "Error: Host b725c95c-17af-41ae-a9c5-deeb1b7bfc50 took too long to restart
          at Xapi.rollingPoolReboot (file:///opt/xo/xo-builds/xen-orchestra-202404111938/packages/xo-server/src/xapi/mixins/pool.mjs:127:9)
          at Xapi.rollingPoolUpdate (file:///opt/xo/xo-builds/xen-orchestra-202404111938/packages/xo-server/src/xapi/mixins/patching.mjs:506:5)
          at XenServers.rollingPoolUpdate (file:///opt/xo/xo-builds/xen-orchestra-202404111938/packages/xo-server/src/xo-mixins/xen-servers.mjs:689:5)
          at Xo.rollingUpdate (file:///opt/xo/xo-builds/xen-orchestra-202404111938/packages/xo-server/src/api/pool.mjs:231:3)
          at Api.#callApiMethod (file:///opt/xo/xo-builds/xen-orchestra-202404111938/packages/xo-server/src/xo-mixins/api.mjs:366:20)"
      }
      
      A 1 Reply Last reply Reply Quote 0
      • A Offline
        Andrew Top contributor @dsiminiuk
        last edited by

        @dsiminiuk I think it's about 15 minutes... I don't see it being adjustable in XO.

        I thought my HP G8 servers were slow to boot at 10 minutes....

        What would you recommend as a timeout? 30 minutes?

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          20 minutes to reboot? Wow 😬 Do you know what is taking so much time? We can change the default obviously.

          D 1 Reply Last reply Reply Quote 0
          • D Offline
            DustinB @olivierlambert
            last edited by

            @olivierlambert said in Rolling Pool Update - host took too long to restart:

            20 minutes to reboot? Wow 😬 Do you know what is taking so much time? We can change the default obviously.

            There is likely a faulty disk involved here that simply isn't known about yet.

            I would look further at the host as a whole before making changes that would impact everyone else.

            D 1 Reply Last reply Reply Quote 0
            • planedropP Offline
              planedrop Top contributor
              last edited by

              That is a really long reboot time, I'd investigate why it's taking so long, any even remotely modern server should be ~5 minutes even for a pretty long POST time. Even my super old Supermicro storage server from like 2015 boots in less than 5 minutes.

              1 Reply Last reply Reply Quote 0
              • D Offline
                dsiminiuk @DustinB
                last edited by dsiminiuk

                @DustinB Not a faulty disk. It appears to be memory testing at boot time and at other times after init doing the same thing,

                The cluster is a pair of HPE ProLiant DL580 Gen9 servers, each with 2TB of RAM.

                Yes, I could turn off memory checking during startup, but I'd rather not.

                Danny

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Ping @pdonias : what value do we have right now? How about raising it to even longer?

                  pdoniasP 1 Reply Last reply Reply Quote 0
                  • pdoniasP Online
                    pdonias Vates 🪐 XO Team @olivierlambert
                    last edited by

                    @olivierlambert By default, it's 20 minutes. And it's already configurable through xo-server's config by adding:

                    [xapiOptions]
                    restartHostTimeout = '40 minutes'
                    
                    1 Reply Last reply Reply Quote 3
                    • Tristis OrisT Offline
                      Tristis Oris Top contributor
                      last edited by

                      got that issue too. Sometimes server restart takes longer than usual, so rolling is canceled by timeout.
                      Why it's so long? i dunno. Maybe some startup checks. Can't restart production to notice any difference.

                      Is the timeout really requried?

                      nikadeN 1 Reply Last reply Reply Quote 0
                      • nikadeN Offline
                        nikade Top contributor
                        last edited by

                        Our Dell R630's with 512Gb RAM also takes a while to reboot, so yeah being able to adjust the value is great.

                        1 Reply Last reply Reply Quote 0
                        • nikadeN Offline
                          nikade Top contributor @Tristis Oris
                          last edited by

                          @Tristis-Oris said in Rolling Pool Update - host took too long to restart:

                          got that issue too. Sometimes server restart takes longer than usual, so rolling is canceled by timeout.
                          Why it's so long? i dunno. Maybe some startup checks. Can't restart production to notice any difference.

                          Is the timeout really requried?

                          If they have ECC it will check the memory, collect diagnostics and so on, it is pretty common on enterprise servers.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            Thanks @pdonias I forgot about this 🙂 I didn't check in the doc, have we documented that too?

                            pdoniasP 1 Reply Last reply Reply Quote 0
                            • pdoniasP Online
                              pdonias Vates 🪐 XO Team @olivierlambert
                              last edited by

                              @olivierlambert It doesn't look like we did. It's documented in the config file but we can add it to the RPU doc too if necessary.

                              1 Reply Last reply Reply Quote 1
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Let's do that then, this will reduce a potential thread or two in here 🙂

                                D 1 Reply Last reply Reply Quote 1
                                • D Offline
                                  dsiminiuk @olivierlambert
                                  last edited by

                                  @olivierlambert I've made the needed adjustment in the build script to override the default. Now I wait for another set of patches to test it.
                                  Thanks all.

                                  1 Reply Last reply Reply Quote 2
                                  • Tristis OrisT Offline
                                    Tristis Oris Top contributor
                                    last edited by

                                    just installed latest updates, rolling again was canceled by timeout. Since that never happens before, i think it begin after some updates about 2-3 months ago.

                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates 🪐 Co-Founder CEO
                                      last edited by

                                      It's hard to give an answer because we are not inside your infrastructure. How long your host took to reboot in the end?

                                      Tristis OrisT 1 Reply Last reply Reply Quote 0
                                      • Tristis OrisT Offline
                                        Tristis Oris Top contributor @olivierlambert
                                        last edited by Tristis Oris

                                        @olivierlambert according to monitoring it takes 10min. >.<
                                        maaaybe some disabled VMs is started after reboot, so it was not enough memory for rolling.

                                        But at previous time, reboot really takes very long.

                                        i see here lack of logs. Nothing tell me that rolling was canceled.

                                        1 Reply Last reply Reply Quote 0
                                        • olivierlambertO Offline
                                          olivierlambert Vates 🪐 Co-Founder CEO
                                          last edited by

                                          We are introducing an XO task to monitor the RPU process. That will be easier to track the whole process 🙂

                                          nikadeN 1 Reply Last reply Reply Quote 1
                                          • Tristis OrisT Offline
                                            Tristis Oris Top contributor
                                            last edited by

                                            next pool, almost empty, enough memory for rolling, reboot takes 5min.
                                            2nd host not updated.

                                            012f7325-f92b-403a-bd4b-9c665f2ac7fc-изображение.png

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post