Yeah.. this definitely was a nightmare, I am taking a few days off after this
Best posts made by nvs
-
RE: PCIe USB card (and PCIe bridge) disappear after host reboot
-
RE: PCI Passthrough with both GPU and USB
Hi,
I stumbled across exactly the same issue that GPU and USB pcie cards would crash the VM if passed through together. It was already mentioned in an earlier reply that updating would fix the issue, and I just wanted to confirm that works for me as well. After running the following commands on my xcp-ng 8.2:yum update yum upgrade
everything works nicely now! Thanks!
Latest posts made by nvs
-
RE: XCP-NG server crashes/reboots unexpectedly
Regarding memory test: Just running the normal mem test from Grub should do, I guess?
-
RE: XCP-NG server crashes/reboots unexpectedly
@stormi Thanks. I've gone through the kern.log.1/2/3 etc and I can see when the server seems to have rebooted and comes back up again, but there doesnt seem to be anything logged just before it quits.
-
RE: XCP-NG server crashes/reboots unexpectedly
@olivierlambert I looked through the xensource.log.1/2/3 etc files.
What sticks out to me is that there is a gap here:
xensource.log.2's last four lines: Nov 19 16:43:27 xcp-ng xapi: [debug||3064510 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.scan D:*** created by task D:*** Nov 19 16:43:27 xcp-ng xapi: [ info||3064512 /var/lib/xcp/xapi||taskhelper] task SR.scan R:*** (uuid:***) created (trackid=***) by task D:*** Nov 19 16:43:27 xcp-ng xapi: [debug||3064512 /var/lib/xcp/xapi|SR.scan R:***|message_forwarding] SR.scan: SR = '*** (20TB HDD)' Nov 19 16:51:30 xcp-ng xapi: [debug||3069218 /var/lib/xcp/xapi|session.slave_local_login_with_password D:***|xapi_session] Add session to local storage
Then the next four lines are in a new xensource.log.1 file, but notice a 49 minute gap until then:
xensource.log.1's first four lines: Nov 19 17:40:08 xcp-ng xenopsd-xc: [debug||5 ||xenops_server] Received an event on managed VM *** Nov 19 17:40:08 xcp-ng xcp-rrdd: [ info||9 ||rrdd_main] memfree has changed to 4191660 in domain 9 Nov 19 17:40:08 xcp-ng xenopsd-xc: [debug||5 |queue|xenops_server] Queue.push ["VM_check_state","***"] onto ***:[ ] Nov 19 17:40:08 xcp-ng xenopsd-xc: [debug||40 ||xenops_server] Queue.pop returned ["VM_check_state","***"]
I've redacted some UUID's with ***, probably wasn't needed but just in case.
From the earlier graphs I expected to not see any log here (assuming the machine was off or whatever), but it seems it was actually running most of the time. The above 49min gap seems to be the only longer gap I can spot at first sight in the last days log. Strange because in XOA the graph shows as if the host was down for like 23h or so. Any thoughts?
-
RE: XCP-NG server crashes/reboots unexpectedly
@olivierlambert Hi, do you have some pointers which exact files to check for those?
I've looked at:
- /var/log/xensource.log, but that log seems to have started earlier today. I don't see entries back to when the reboots happened.
- Regarding IPMI logs: This machine is a Ryzen 9 5950X on a Asus Prime X570 Pro motherboard. It doesn't have IPMI unfortunately.
I will look into doing a memtest as you suggested.
-
RE: XCP-NG server crashes/reboots unexpectedly
I took a look at the performance graphs in XOA and the two reboots can clearly be seen. What looks interesting to me is that the server seems to have stayed offline for quite a while (when there are no data points in the graph) ? And only then came back up. Also after the 2nd reboot there seems to be a high load average, even though only one VM is on auto-start (xen orchestra) and no other VMs were started yet.
Anyone who can make something of this? It seems weird to me that whatever induces a reboot of the system would not bring it up directly again, but in fact have varying durations until XCP-NG+XOA is back up, according to the graphs. Based on this, it seems after 1st reboot it was down for ~3h. After 2nd reboot it was down for about ~23h.
Also note:
- This server was running stable in this exact configuration for almost 2 years now.
- I have two other pretty much identical servers that do not have this issue (same rack, same power source)
-
XCP-NG server crashes/reboots unexpectedly
Hi,
In the last 1.5 weeks my server seems to have rebooted itself at least two times. I noticed this because my VMs weren't running anymore. It seemed like a fresh reboot of the server. I want to figure out what the reason is and fix the issue. I started looking at logs i.e. at /var/log/kern.log and the /var/crash folder, but both file and folder are completely empty.
My conclusion on the above kern.log and crash folder being empty is that it probably wasn't XCP-NG crashing causing the reboot.
My conclusion would be that its likely power (PSU or motherboard) related. Any other logs I should check/or any other comments you may have (on my conclusions above) ?
Thanks!
-
RE: Questions about backup features
@CJ said in Questions about backup features:
I have one schedule that runs daily to perform a delta backup and then a second schedule with force full backup enabled that only runs once a week. Make sure you remove the weekly full backup day from the nightly schedule or you'll have both backups happening.
@CJ just curious: why would you do a full backup once a week? In principle your remote delta backup will already stay "up-to-date" with just the delta backups right? Or am I missing something?
-
RE: PCIe USB card (and PCIe bridge) disappear after host reboot
Yeah.. this definitely was a nightmare, I am taking a few days off after this
-
RE: PCIe USB card (and PCIe bridge) disappear after host reboot
After another full day of troubleshooting it looks like I found the issue..
Installed Ubuntu Server and tested the plugged in USB cards that were detected to figure out which one was the one dropping out. Turns out if that card is in any of the PCIe slots it will cause the issues seen. If its not installed in the server no cards disappear.
I've removed an identical and known working PCIe USB card from my 2nd machine and replaced the faulty one. It seems everything is working fine again. Quite interesting how a faulty card resulted in this rollercoaster of symptoms seen.. at least some nice lessons learned for the future
-
RE: PCIe USB card (and PCIe bridge) disappear after host reboot
Tried some more things but nothing resolved the issue:
-
Put RAM speed from DDR4-3200 to AUTO -> Same issue
-
Put a different GPU (removed the Nvidia K2200 GPU) but still breaks when i.e. starting with 0 plugged in SATA devices to plugging in 1st SATA HDD.. -> Same issue
-
Reseated CPU and checked for any bent pins (looked all OK) and re-pasted it -> Same issue
-
Tried using different output on K2200 GPU (output 2 (DP) instead of usually output 3 (DP)) -> Same issue
-
Tried without any GPU at all (also not onboard GPU, as this CPU doesnt have integrated graphics) -> Same issue
-
Took out PCIe USB cards one by one (had no GPU installed at all while testing that, had 10gig card in top PCIe slot for a change, and 1x HDD attached via SATA). Then removed one by one the PCIe USB cards:
^Every time I remove one and boot, it shows the correct amount of PCIe USB cards first time. Then after reboot always one PCIe USB card-1 less.. That amount then also seems to stay across reboots. However, when only one PCIe USB card is left, that card seems to stay recognized and does not disappear after a reboot! -
Reset bios settings (still using latest BIOS version) by removing battery and shorting RTC reset pins. Left bios at untouched defaults and booted into XCP-NG -> Same issue
-
Removed all RAM modules and installed just one RAM stick -> Same issue
-
Downgraded BIOS to version 4408 and left at BIOS defaults -> Same issue
It looks like the system likes eating the PCIe USB cards. I will try ASUS customer support tomorrow but I am not expecting much from that..
Could this be an IRQ conflict? What still baffles me is how the issue isnt resolved if the machine is shut off for say 30 secs, but is after it was off for 10 minutes. It would then usually boot up with all cards recognized again.. In the back of my mind I am imagining some hardware failure that depends on something capacitively charged that could explain such time-delay behaviour.. Any thoughts/other ideas?
-