Rolling Pool Update - host took too long to restart
-
@DustinB Not a faulty disk. It appears to be memory testing at boot time and at other times after init doing the same thing,
The cluster is a pair of HPE ProLiant DL580 Gen9 servers, each with 2TB of RAM.
Yes, I could turn off memory checking during startup, but I'd rather not.
Danny
-
Ping @pdonias : what value do we have right now? How about raising it to even longer?
-
@olivierlambert By default, it's 20 minutes. And it's already configurable through xo-server's config by adding:
[xapiOptions] restartHostTimeout = '40 minutes'
-
got that issue too. Sometimes server restart takes longer than usual, so rolling is canceled by timeout.
Why it's so long? i dunno. Maybe some startup checks. Can't restart production to notice any difference.Is the timeout really requried?
-
Our Dell R630's with 512Gb RAM also takes a while to reboot, so yeah being able to adjust the value is great.
-
@Tristis-Oris said in Rolling Pool Update - host took too long to restart:
got that issue too. Sometimes server restart takes longer than usual, so rolling is canceled by timeout.
Why it's so long? i dunno. Maybe some startup checks. Can't restart production to notice any difference.Is the timeout really requried?
If they have ECC it will check the memory, collect diagnostics and so on, it is pretty common on enterprise servers.
-
Thanks @pdonias I forgot about this I didn't check in the doc, have we documented that too?
-
@olivierlambert It doesn't look like we did. It's documented in the config file but we can add it to the RPU doc too if necessary.
-
Let's do that then, this will reduce a potential thread or two in here
-
@olivierlambert I've made the needed adjustment in the build script to override the default. Now I wait for another set of patches to test it.
Thanks all. -
just installed latest updates, rolling again was canceled by timeout. Since that never happens before, i think it begin after some updates about 2-3 months ago.
-
It's hard to give an answer because we are not inside your infrastructure. How long your host took to reboot in the end?
-
@olivierlambert according to monitoring it takes 10min. >.<
maaaybe some disabled VMs is started after reboot, so it was not enough memory for rolling.But at previous time, reboot really takes very long.
i see here lack of logs. Nothing tell me that rolling was canceled.
-
We are introducing an XO task to monitor the RPU process. That will be easier to track the whole process
-
next pool, almost empty, enough memory for rolling, reboot takes 5min.
2nd host not updated. -
@olivierlambert said in Rolling Pool Update - host took too long to restart:
We are introducing an XO task to monitor the RPU process. That will be easier to track the whole process
This should also be displayed in the pool overview to make sure other admins dont start other tasks by mistake.
In a perfect world there would be a indicative icon by the pool name and also a warning in the general pool overview with some kind of notification ,like "Pool upgrade in progress - Wait for it to complete before starting new tasks" or similar in a very red/orange/other visible color that you cant miss: -
Good idea for XO 6 likely (because they are not XAPI but XO tasks and I don't think we planned to use the task to have an impact on XAPI objects). But that's a nice idea Adding @pdonias in the loop
-
@olivierlambert I finally had a chance to apply patches to the two ProLiant servers with the 20 minute boot time and everything worked as expected.
-
Yay!! Thanks for the feedback @dsiminiuk !
-
This seems to be an issue still for myself.
Running XO - CE 9321bThe last few updates on different 2 different clusters, have resulted in failed rolling updates. The most current one today.
The odd part is that the server is back up in about 8 min and I can attach to it via XO.The system never reconnects to complete the upgrade process. resulting in having to manually apply patches move VMs and restore HA.
Our
xo-server/config.toml
file has been updated so `restartHostTimeout = '60 minutes' to try to overcome this but to no avail.{ "id": "0m1s8k827", "properties": { "poolId": "62d8471c-e515-0d7a-d77f-5ac38a945507", "poolName": "TESTING-POOL-01", "progress": 33, "name": "Rolling pool update", "userId": "61ea8f96-4e67-468f-a0cf-9d9711482a42" }, "start": 1727895825871, "status": "failure", "updatedAt": 1727897834288, "tasks": [ { "id": "go7y3ilbo9u", "properties": { "name": "Listing missing patches", "total": 3, "progress": 100 }, "start": 1727895825873, "status": "success", "tasks": [ { "id": "wesndzbzb0r", "properties": { "name": "Listing missing patches for host dd36b6f8-ffff-4310-aa29-66f312c83930", "hostId": "dd36b6f8-ffff-4310-aa29-66f312c83930", "hostName": "TESTING-03" }, "start": 1727895825873, "status": "success", "end": 1727895825874 }, { "id": "cib36mtqu1k", "properties": { "name": "Listing missing patches for host 6fb0707f-2655-4ef2-a0a2-13abe70e3077", "hostId": "6fb0707f-2655-4ef2-a0a2-13abe70e3077", "hostName": "TESTING-02" }, "start": 1727895825873, "status": "success", "end": 1727895825874 }, { "id": "9zk1dabnwuj", "properties": { "name": "Listing missing patches for host 1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostId": "1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostName": "TESTING-01" }, "start": 1727895825874, "status": "success", "end": 1727895825874 } ], "end": 1727895825874 }, { "id": "ayiur0vq5bn", "properties": { "name": "Updating and rebooting" }, "start": 1727895825874, "status": "failure", "tasks": [ { "id": "7l55urnrsdj", "properties": { "name": "Restarting hosts", "progress": 22 }, "start": 1727895835047, "status": "failure", "tasks": [ { "id": "y7ba4451qg", "properties": { "name": "Restarting host 1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostId": "1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostName": "TESTING-01" }, "start": 1727895835047, "status": "failure", "tasks": [ { "id": "qtt61zspfwl", "properties": { "name": "Evacuate", "hostId": "1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostName": "TESTING-01" }, "start": 1727895835285, "status": "success", "end": 1727896526200 }, { "id": "3i41dez1f3q", "properties": { "name": "Installing patches", "hostId": "1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostName": "TESTING-01" }, "start": 1727896526200, "status": "success", "end": 1727896634080 }, { "id": "jmhv7qbyyid", "properties": { "name": "Restart", "hostId": "1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostName": "TESTING-01" }, "start": 1727896634110, "status": "success", "end": 1727896634248 }, { "id": "ky40100any9", "properties": { "name": "Waiting for host to be up", "hostId": "1f4b8cd7-e9da-414e-8558-8059a3165b98", "hostName": "TESTING-01" }, "start": 1727896634248, "status": "failure", "end": 1727897834282, "result": { "message": "Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart", "name": "Error", "stack": "Error: Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:152:17\n at /opt/xen-orchestra/@vates/task/index.js:54:40\n at AsyncLocalStorage.run (node:async_hooks:346:14)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:41)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:31)\n at Function.run (/opt/xen-orchestra/@vates/task/index.js:54:27)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:142:24\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:112:11\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Xapi.rollingPoolReboot (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:102:5)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/patching.mjs:524:7\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at XenServers.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/xen-servers.mjs:703:5)\n at Xo.rollingUpdate (file:///opt/xen-orchestra/packages/xo-server/src/api/pool.mjs:243:3)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)" } } ], "end": 1727897834282, "result": { "message": "Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart", "name": "Error", "stack": "Error: Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:152:17\n at /opt/xen-orchestra/@vates/task/index.js:54:40\n at AsyncLocalStorage.run (node:async_hooks:346:14)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:41)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:31)\n at Function.run (/opt/xen-orchestra/@vates/task/index.js:54:27)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:142:24\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:112:11\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Xapi.rollingPoolReboot (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:102:5)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/patching.mjs:524:7\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at XenServers.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/xen-servers.mjs:703:5)\n at Xo.rollingUpdate (file:///opt/xen-orchestra/packages/xo-server/src/api/pool.mjs:243:3)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)" } } ], "end": 1727897834282, "result": { "message": "Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart", "name": "Error", "stack": "Error: Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:152:17\n at /opt/xen-orchestra/@vates/task/index.js:54:40\n at AsyncLocalStorage.run (node:async_hooks:346:14)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:41)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:31)\n at Function.run (/opt/xen-orchestra/@vates/task/index.js:54:27)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:142:24\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:112:11\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Xapi.rollingPoolReboot (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:102:5)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/patching.mjs:524:7\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at XenServers.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/xen-servers.mjs:703:5)\n at Xo.rollingUpdate (file:///opt/xen-orchestra/packages/xo-server/src/api/pool.mjs:243:3)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)" } } ], "end": 1727897834288, "result": { "message": "Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart", "name": "Error", "stack": "Error: Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:152:17\n at /opt/xen-orchestra/@vates/task/index.js:54:40\n at AsyncLocalStorage.run (node:async_hooks:346:14)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:41)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:31)\n at Function.run (/opt/xen-orchestra/@vates/task/index.js:54:27)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:142:24\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:112:11\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Xapi.rollingPoolReboot (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:102:5)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/patching.mjs:524:7\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at XenServers.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/xen-servers.mjs:703:5)\n at Xo.rollingUpdate (file:///opt/xen-orchestra/packages/xo-server/src/api/pool.mjs:243:3)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)" } } ], "end": 1727897834288, "result": { "message": "Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart", "name": "Error", "stack": "Error: Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:152:17\n at /opt/xen-orchestra/@vates/task/index.js:54:40\n at AsyncLocalStorage.run (node:async_hooks:346:14)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:41)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:31)\n at Function.run (/opt/xen-orchestra/@vates/task/index.js:54:27)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:142:24\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:112:11\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Xapi.rollingPoolReboot (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:102:5)\n at file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/patching.mjs:524:7\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)\n at XenServers.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/xen-servers.mjs:703:5)\n at Xo.rollingUpdate (file:///opt/xen-orchestra/packages/xo-server/src/api/pool.mjs:243:3)\n at Task.runInside (/opt/xen-orchestra/@vates/task/index.js:169:22)\n at Task.run (/opt/xen-orchestra/@vates/task/index.js:153:20)" } }