XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XO server loses pool and hosts momentarily, timeout error

    Scheduled Pinned Locked Moved Management
    24 Posts 5 Posters 1.8k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • F Offline
      felibb
      last edited by

      Same issue with the latest commit. Hunting for a commit that may or may not work is a wild goose chase, I don't really have the time for this, especially since I agree it is hard to tell, and XCP-ng can easily be the culprit here, hope I didn't imply that XO has to be at fault. I just didn't see any errors in /var/log/xensource.log, but maybe I wasn't looking in the right place. I was more hoping for some debugging hints I didn't think of myself.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Sadly, since we can't reproduce, that would be very helpful if you had time to try few other commits and see if the behavior change or not. We have some potential ideas on what could cause this, so trying with a commit before we swapped to "undici" as HTTP lib could be helpful. @julien-f might provide some commits to test 🙂

        1 Reply Last reply Reply Quote 0
        • F Offline
          felibb
          last edited by felibb

          Okay, tried a few at random and narrowed it down to this:

          • 0ccfd4b / 2024.03.14: has timeouts
          • 18dea2f / 2024.02.08: does not have timeouts

          These two have about 100 commits between them. Any suggestions on how to narrow it down further?

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            git bisect between those 2 commits could be your friend 🙂 @julien-f explained it here: https://xcp-ng.org/forum/post/58981

            1 Reply Last reply Reply Quote 0
            • F Offline
              felibb
              last edited by

              @olivierlambert thanks for the tip. Looks like bfb8d3b29e4f9531dda368f6624652479682b69d is the culprit, and the comment mentions "http-request-plus → undici" which seems like what you referred to above. Some earlier commits had weird glitches like not displaying any VMs / any storage, but they did not time out.

              A 1 Reply Last reply Reply Quote 1
              • A Offline
                Andrew Top contributor @felibb
                last edited by

                @felibb There were some issues with undici that were resolved in a later commit 0794a63 (early April). It might be worth trying after that fix too.

                F 1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Thanks @felibb for the feedback, this will indeed be helpful for @julien-f to track it. It's weird we can't reproduce it here, but at last we now know it comes from undici.

                  The main question is why it happens to you and few people and not everyone else.

                  1. Can you try to use XOA in latest release channel in the same environment and see if you also have the issue?
                  2. Is your XO far away from the pool in terms of network latency?
                  3. Your OS is Debian 11, IDK if that could cause the problem (XOA is on Debian 12).

                  At least, let's see for 1: that should help us to determine if it's related to your environment OR something in XO's code with your environment

                  1 Reply Last reply Reply Quote 0
                  • julien-fJ Offline
                    julien-f Vates 🪐 Co-Founder XO Team @felibb
                    last edited by

                    @felibb We've been unable to reproduce so far, I'm waiting for someone else confirmation before attempting to fix it on master.

                    If you can, please test the xen-api-blocking branch and let me know if that helps.

                    1 Reply Last reply Reply Quote 0
                    • Tristis OrisT Offline
                      Tristis Oris Top contributor
                      last edited by

                      weird advice, but i got same problem when XO CR copy started and caused ip conflict with main XO.

                      1 Reply Last reply Reply Quote 1
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        That's not weird, an IP conflict could also explain this issue.

                        Tristis OrisT 1 Reply Last reply Reply Quote 0
                        • Tristis OrisT Offline
                          Tristis Oris Top contributor @olivierlambert
                          last edited by

                          @olivierlambert of course it should. i just adivce to check it, just in case.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            And it's a good advice 🙂

                            1 Reply Last reply Reply Quote 0
                            • F Offline
                              felibb @Andrew
                              last edited by

                              @Andrew said in XO server loses pool and hosts momentarily, timeout error:

                              some issues with undici that were resolved in a later commit 0794a63

                              Tried 79c9ef0 (1 day older than 0794a63), seeing timeouts.

                              @olivierlambert said in XO server loses pool and hosts momentarily, timeout error:

                              1. Can you try to use XOA in latest release channel in the same environment and see if you also have the issue?

                              Unsure I understand what you are referring to, can you please clarify?

                              1. Is your XO far away from the pool in terms of network latency?

                              I would expect it the latency to be quite low: XOA VM lives on the same pool, has an IP in the same subnet as 10Gx2 bond interface on each host. This is not however the same 1G network as the one marked with "Management" blue bubble in the Host network tab. These two are different subnets. Can this have an effect?

                              1. Your OS is Debian 11, IDK if that could cause the problem (XOA is on Debian 12).

                              dist-upgrade is fast and easy, I can definitely try that.

                              @julien-f said in XO server loses pool and hosts momentarily, timeout error:

                              If you can, please test the xen-api-blocking branch and let me know if that helps.

                              ce15ef6 deployed, seeing timeouts.

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                @felibb I'm talking about using are pre-baked/turnkey virtual appliance, that you can easily deploy from https://vates.tech/deploy

                                1. Register
                                2. Update and select "latest" release channel
                                3. Test

                                This will allow to check if it's your setup or XO.

                                F 1 Reply Last reply Reply Quote 0
                                • F Offline
                                  felibb @olivierlambert
                                  last edited by

                                  @olivierlambert right, XO vs. XOA, gotcha. XOA seems to work fine, no timeouts for about 1/2hr. I did select "Management" LAN for it.

                                  I think the next step for me would be to upgrade my old XO to bookworm + latest commit in master. Then I probably can try a fresh VM with bookworm + XO latest commit in master + interface in mgmt LAN.

                                  F 1 Reply Last reply Reply Quote 0
                                  • olivierlambertO Offline
                                    olivierlambert Vates 🪐 Co-Founder CEO
                                    last edited by

                                    Okay so XOA works fine on both stable & latest channels, fully up to date right? Double checking to be 100% sure 🙂

                                    F 1 Reply Last reply Reply Quote 0
                                    • F Offline
                                      felibb @olivierlambert
                                      last edited by

                                      @olivierlambert both channels seem to work fine, yes.

                                      1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates 🪐 Co-Founder CEO
                                        last edited by

                                        Okay so it's clearly something related to your source installation and/or an interaction with your setup 🙂 Thanks for the feedback!

                                        1 Reply Last reply Reply Quote 0
                                        • F Offline
                                          felibb @felibb
                                          last edited by

                                          @felibb said in XO server loses pool and hosts momentarily, timeout error:

                                          upgrade my old XO to bookworm + latest commit in master

                                          Welp, that didn't help much, still seeing timeouts. Also neither XO nor XOA show the VM's own IP in the GUI anymore. dist-upgrade renamed interface from eth0 to etX0, and I had to edit /etc/network/interfaces to get the network back up, and I can connect, but GUI still says "No IP record". Management agent 8.0.50-1 detected, in case it matters.

                                          Fresh VM setup to be tested another day.

                                          F 1 Reply Last reply Reply Quote 0
                                          • F Offline
                                            felibb @felibb
                                            last edited by

                                            (Replying to my previous post, a bit off-topic for the thread, but having installed https://github.com/xenserver/xe-guest-utilities/releases/tag/v8.4.0 manually, I see the IP in GUI now, but XOA says "Management agent 8.3.60-1 detected")

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post