XCP simply hangs
-
Hey folks,
I am at a loss here. For some years I was happily running XCP on my Intel Nuc 7 with 32GB of ram and an SSD. For some time now, every once in a while it simply stopped working. So I went full out in analyze mode.
Here are my findings:
- I can see no entries in any of the logs (//var/log/*). It just stops and then bootup messages occur.
- There are no entries in the event log of the Bios (no thermal shutdown or the likes)
So I vacated all the vms (16gb used, nearly no cpu is in use) to a full sized rack server (running xcp, too). So I am kinda up and running, but that hardware cant stay here forever.
So I debugged the little NUC as best as I could:
- I swapped the ram for two new modules.
- I ran memtest for a day without an issue.
- I installed Almalinux 9 and let cpuburn run for half a day without an issue.
- I let fio really work the ssd. (Read only, tho...)
I was unable to break the system, so its not memory or cooling.
So I installed a clean XCP 8.2 and let it update to current.
- I ran a single vm (1 cpu, 8gb ram) and let memtest run in it for 10 minutes. No issues.
- I cloned the vm and let it run in paralell. 10 minutes, no issues.
- Same for vm3.
I was able to let this run for hours. Now came vm 4. I reduced the memory size to accomodate for the dom0 memory and with ~5gb ram I let it run, too.
It near-instantly crashed after 2 minutes.
I restarted the Host and ran all four vms again, started in parallel, dead in a few moments.
At this point I wiped the host clean again, and installed a fresh 8.0 (Downgrade from 8.2.1) and after patching, let it run again. Same results.
Currently there is a single VM utilizing ALL the ram (minus the dom0 memory) and all the cpus. No issue found, yet (17 minutes in).
So to sum um:
- It's probably not a memory issue (tested & swapped).
- It's probably not a cpu or heat issue.
- Stress test outside the scope of XCP work, even in similar OS environments.
- Allocating all CPUs and Memory to Single memtest VM seems to work.
- Splitting the above up to 4 vms let it crash.
- No logs, just hanging.
Oh, I updated the BIOS to Version 88 (Thats from march'ish this year).
Anyone could help me debug this further or has an idea?
-
"Hangs" means no keyboard input works, right? The screen is frozen?
-
@olivierlambert Thanks for replying. It is totally dead. I can see the xsconsole, but it's dead. No extra lines or printouts. I even tried switching to tty2(3?) with system message, which remained empty.
-
So the keyboard input is working since you can change to other TTYs?
-
No, it hangs.
Once it stops, nothing works. To be able to see the other ttys you have to reboot the host, go to the desired tty, then run the tests again. If it crashes youre stuck there.
-
This issue is closed but unresolved.
I moved to a different hardware. -
Having a serial console might help in those situation. @stormi do we have any guide to do advanced debug?
-
@olivierlambert I don't think so
-
That would be interesting to get some pointers (like serial debug and so on). I wonder if there's already some resources for Xen Project somewhere