XO server loses pool and hosts momentarily, timeout error
-
XO server: 2 vCPU, 4GiB RAM
OS: Debian 11 / 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31) x86_64 GNU/Linux
Node.js version: v18.20.2
Yarn version: 1.22.19
XO version: https://github.com/vatesfr/xen-orchestra/commit/771b04acc4480cf138a0c476968d7c613bb8147d
XCP-NG server version: 8.2.1
Environment: 3 hosts, HA, shared storageThe problem is that pool, hosts, VMs (all inventory except one manually added server) seem to disappear from the web UI every 3-6 min, only to reappear automagically after exactly 1 min.
No network changes that could explain timeouts. All was working fine until last week. In fact I think it started after the server patch/update to 8.2.1 (don't recall from which version), the only significant change I did, but no errors in the server logs.
xo-server
logs this when it loses the pool (but nothing when the pool reappears):May 8 17:06:32 xo-ce xo-server[328]: _watchEvents TimeoutError: operation timed out May 8 17:06:32 xo-ce xo-server[328]: at Promise.timeout (/opt/xo/xo-builds/xen-orchestra-202405070909/node_modules/promise-toolbox/timeout.js:11:16) May 8 17:06:32 xo-ce xo-server[328]: at Xapi.apply (file:///opt/xo/xo-builds/xen-orchestra-202405070909/packages/xen-api/index.mjs:773:37) May 8 17:06:32 xo-ce xo-server[328]: at Xapi._call (/opt/xo/xo-builds/xen-orchestra-202405070909/node_modules/limit-concurrency-decorator/src/index.js:85:24) May 8 17:06:32 xo-ce xo-server[328]: at Xapi._watchEvents (file:///opt/xo/xo-builds/xen-orchestra-202405070909/packages/xen-api/index.mjs:1198:31) { May 8 17:06:32 xo-ce xo-server[328]: call: { May 8 17:06:32 xo-ce xo-server[328]: method: 'event.from', May 8 17:06:32 xo-ce xo-server[328]: params: [ [Array], '00000000000063727552,00000000000063699698', 60.1 ] May 8 17:06:32 xo-ce xo-server[328]: } May 8 17:06:32 xo-ce xo-server[328]: } May 8 17:09:32 xo-ce xo-server[328]: _watchEvents TimeoutError: operation timed out May 8 17:09:32 xo-ce xo-server[328]: at Promise.timeout (/opt/xo/xo-builds/xen-orchestra-202405070909/node_modules/promise-toolbox/timeout.js:11:16) May 8 17:09:32 xo-ce xo-server[328]: at Xapi.apply (file:///opt/xo/xo-builds/xen-orchestra-202405070909/packages/xen-api/index.mjs:773:37) May 8 17:09:32 xo-ce xo-server[328]: at Xapi._call (/opt/xo/xo-builds/xen-orchestra-202405070909/node_modules/limit-concurrency-decorator/src/index.js:85:24) May 8 17:09:32 xo-ce xo-server[328]: at Xapi._watchEvents (file:///opt/xo/xo-builds/xen-orchestra-202405070909/packages/xen-api/index.mjs:1198:31) { May 8 17:09:32 xo-ce xo-server[328]: call: { May 8 17:09:32 xo-ce xo-server[328]: method: 'event.from', May 8 17:09:32 xo-ce xo-server[328]: params: [ [Array], '00000000000063727963,00000000000063699698', 60.1 ] May 8 17:09:32 xo-ce xo-server[328]: } May 8 17:09:32 xo-ce xo-server[328]: }
Some xcp-ng forum posts from 2023 talked about downgrading to node.js v18 as a solution to a similar timeout issue, but I am already on v18. Would be grateful for any hints, and can share more info.
-
Hi,
First, as stated in our doc, be sure to test with the latest commit (it's not your case right now).
If it's still an issue, try to test with an older commit, like 1 month ago, and repeat for the last few months. If you still have the issue, it might be not related to XO but XCP-ng? Hard to tell, but the easiest check is XO.
-
Same issue with the latest commit. Hunting for a commit that may or may not work is a wild goose chase, I don't really have the time for this, especially since I agree it is hard to tell, and XCP-ng can easily be the culprit here, hope I didn't imply that XO has to be at fault. I just didn't see any errors in
/var/log/xensource.log
, but maybe I wasn't looking in the right place. I was more hoping for some debugging hints I didn't think of myself. -
Sadly, since we can't reproduce, that would be very helpful if you had time to try few other commits and see if the behavior change or not. We have some potential ideas on what could cause this, so trying with a commit before we swapped to "undici" as HTTP lib could be helpful. @julien-f might provide some commits to test
-
-
git bisect
between those 2 commits could be your friend @julien-f explained it here: https://xcp-ng.org/forum/post/58981 -
@olivierlambert thanks for the tip. Looks like bfb8d3b29e4f9531dda368f6624652479682b69d is the culprit, and the comment mentions "http-request-plus → undici" which seems like what you referred to above. Some earlier commits had weird glitches like not displaying any VMs / any storage, but they did not time out.
-
@felibb There were some issues with undici that were resolved in a later commit 0794a63 (early April). It might be worth trying after that fix too.
-
Thanks @felibb for the feedback, this will indeed be helpful for @julien-f to track it. It's weird we can't reproduce it here, but at last we now know it comes from undici.
The main question is why it happens to you and few people and not everyone else.
- Can you try to use XOA in
latest
release channel in the same environment and see if you also have the issue? - Is your XO far away from the pool in terms of network latency?
- Your OS is Debian 11, IDK if that could cause the problem (XOA is on Debian 12).
At least, let's see for 1: that should help us to determine if it's related to your environment OR something in XO's code with your environment
- Can you try to use XOA in
-
@felibb We've been unable to reproduce so far, I'm waiting for someone else confirmation before attempting to fix it on master.
If you can, please test the
xen-api-blocking
branch and let me know if that helps. -
weird advice, but i got same problem when XO CR copy started and caused ip conflict with main XO.
-
That's not weird, an IP conflict could also explain this issue.
-
@olivierlambert of course it should. i just adivce to check it, just in case.
-
And it's a good advice
-
@Andrew said in XO server loses pool and hosts momentarily, timeout error:
some issues with undici that were resolved in a later commit 0794a63
Tried 79c9ef0 (1 day older than 0794a63), seeing timeouts.
@olivierlambert said in XO server loses pool and hosts momentarily, timeout error:
- Can you try to use XOA in
latest
release channel in the same environment and see if you also have the issue?
Unsure I understand what you are referring to, can you please clarify?
- Is your XO far away from the pool in terms of network latency?
I would expect it the latency to be quite low: XOA VM lives on the same pool, has an IP in the same subnet as 10Gx2 bond interface on each host. This is not however the same 1G network as the one marked with "Management" blue bubble in the Host network tab. These two are different subnets. Can this have an effect?
- Your OS is Debian 11, IDK if that could cause the problem (XOA is on Debian 12).
dist-upgrade is fast and easy, I can definitely try that.
@julien-f said in XO server loses pool and hosts momentarily, timeout error:
If you can, please test the
xen-api-blocking
branch and let me know if that helps.ce15ef6 deployed, seeing timeouts.
- Can you try to use XOA in
-
@felibb I'm talking about using are pre-baked/turnkey virtual appliance, that you can easily deploy from https://vates.tech/deploy
- Register
- Update and select "latest" release channel
- Test
This will allow to check if it's your setup or XO.
-
@olivierlambert right, XO vs. XOA, gotcha. XOA seems to work fine, no timeouts for about 1/2hr. I did select "Management" LAN for it.
I think the next step for me would be to upgrade my old XO to bookworm + latest commit in
master
. Then I probably can try a fresh VM with bookworm + XO latest commit inmaster
+ interface in mgmt LAN. -
Okay so XOA works fine on both stable & latest channels, fully up to date right? Double checking to be 100% sure
-
@olivierlambert both channels seem to work fine, yes.
-
Okay so it's clearly something related to your source installation and/or an interaction with your setup Thanks for the feedback!