v8.2.1 rolling pool update getting stuck

Greg_E

I'm doing the latest updates on my production pool, and found something that is really just an annoyance more than a problem.

3 hot pool with Intel Silver v2 and 128GB of ram in each host.

Click on the pool, go to updates, hit the rolling pool update button. VMs migrate out as normal, update get's transferred to Master as normal. Took over 30 minutes after this step to reboot.

Upon reboot, VMs did not migrate back. Task for Rolling pool update did not pick up where it left off. Manually migrated a single VM back to Master to see if things were really still working, seemed OK.

Went to lunch to see if it would sort itself out. Nope.

Came back from lunch, went to the third host and entered maintenance mode to migrate all the VMs, when done told it to install updates, when done told it to reboot. Waiting about 20 minutes so far and guessing it will be a full 30 before it reboots.

Why I say this is an annoyance more than a problem, because we need to be on XCP-ng 8.3 in another two months so I'm not sure this is worth a fix as long as everything functions when I have all three hosts updated.

olivierlambert

Hi,

I'm not sure to get it. Are you sure the RPU was interrupted in the first place? If it was, do you know why? (check the logs)

Greg_E

@olivierlambert

I let it sit for an hour while I was at lunch, if it didn't pick up by then, it wasn't going to restart. I haven't had time to look through the logs but I'd guess there was a timer that expired since it was taking so long to reboot the Master host.

All three of the hosts took about 30 minutes before reboot.

And then the load balancing was all wrong, when I finished the updates I did a rolling pool reboot. It refused to put any VMs on host 2 during this process, and once the pool reboot finished it spent about 30 minutes migrating VMs for what appeared to be no reason, twice ending up with all VMs on host 1 or host 3. After it finally stopped migrating, I manually migrated a few VMs off to host 2 and left it alone for the night. Just got back in and will be checking things this morning.

olivierlambert

Do you have any SR using a VM? (like an ISO SR in a NFS share inside a VM). This is freezing NFS and makes host taking half an hour to restart.
Logs should tell you why the RPU failed if it failed