Server Locks Up Periodically with ASRock X570D4I-2T AMD Ryzen 9 3900X and Intel X550-AT2
-
I’ve been experiencing periodic server lockups with my setup, and I’m at a loss for what’s causing the issue. Here are the details of my configuration:
Hardware:
Server: OnLogic MK150B-40
CPU: AMD Ryzen 9 3900X
Motherboard: ASRockRack X570D4I-2T
Ethernet: 2 x RJ45 10GLAN (Intel X550-AT2)
Software:XCP-NG version: 8.3 (latest)
Symptoms:
The server becomes completely unresponsive after running for a few hours to a few days.The console is entirely frozen (no keyboard input works).
The server cannot be accessed via its assigned IP address.
Fans keep running, and the power LED stays on.I ran memtest on the RAM several times, and also replaced the RAM with new modules. I did get several memtest failures initially which seemed strange as the RAM modules were all brand new.
I didn't see the motherboard and NIC appear on the XCP-NG hardware compatibility list flagged as problematic, other than the intel X550 series not advertising 2.5Gb or 5Gb, but I'm using 1G and 10G anyway.
Checked logs (/var/log/messages, /var/log/xensource.log) for anything obvious before crashes but couldn’t identify any clear issue.Are additional or alternate drivers recommended?
Should I try using an alternate kernel?Are there additional debugging steps or logs I should check to diagnose what’s causing the periodic lockups?
Thanks in advance for your help!
-
@R2rho Do you have a way to view the console output when this happens? Curious if you had a display attached, you may see the remnants of the crash.
And I presume you can't get ping responses from it right? The other thought is maybe it's lost network connectivity but isn't actually a fully locked up host.
There is also info here on the log files you can check.
-
@planedrop No I don't have the ability to see the console output, the server itself becomes completely unresponsive even to keyboard input via USB, so I can't see anything or access any logs immediately after the crash. I do have a display attached, but I didn't have the shell open, just the main console. Once I reboot it then I can navigate back to it and see the logs. I couldn't see anything particularly interesting in the logs the last time I looked, but I also don't know what to look for, so I'll go back and get the log output tomorrow and provide them here. I can also leave the shell open on the display with tail -f /var/log/xensource.log and see if I can capture what happens right at the freeze.
-
Given that BIOS and everything is updated to latest version possible.
First thing I do then with these kinds of symptoms, is to disable all kinds of power management and/or C-states in BIOS.
Some combinations of OS and hardware, just doesn't work properly.
If for nothing else, it's a easy non-intrusive test to do.Update: I see that your motherboard has an IPMI interface. If the issues happen again, after you've disabled power management/c-states. You could use the remote functionality of the impi, to hopefully get some more info from the sensors and stuff.
-
@probain @planedrop Here is the log file, on December 6th it looks like it just froze around line 12190. This morning on Dec 9th I hard rebooted it and copied these log files. Before I hard rebooted it, even though the server was still on and had the XCP-NG console up on the display, it was not responding to any keyboard input and I couldn't find it on the network. I pulled the log file from /var/log/xensource.log and uploaded here. It was a bit long so I trimmed out the end of the Dec 9 portion of the logs to fit the file size upload limit. I'll see if this server has some power management settings that can be disabled. Appreciate your help, I have no idea what to look for in these logs.
Dec 6 10:50:31 xcp-ng-host xapi: [ info||3551 /var/lib/xcp/xapi|session.logout D:633480054271|xapi_session] Session.destroy trackid=11fe74c1adf8a4ab19fff8c5956e6896 Dec 6 10:50:31 xcp-ng-host xapi: [debug||3552 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.slave_login D:1607dbc2d558 created by task D:38c5fbdd7624 Dec 6 10:50:31 xcp-ng-host xapi: [ info||3552 /var/lib/xcp/xapi|session.slave_login D:fc3d92b09f17|xapi_session] Session.create trackid=f33f6493a5cd650abf4a35e26ffc323d pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Dec 6 10:50:31 xcp-ng-host xapi: [debug||3553 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:pool.get_all D:f13a7669aea0 created by task D:fc3d92b09f17 Dec 6 10:50:31 xcp-ng-host xapi: [debug||3554 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:253d3c855775 created by task D:38c5fbdd7624 Dec 6 10:50:54 xcp-ng-host xapi: [debug||355 |xapi events D:9f1127ca7f99|dummytaskhelper] task timeboxed_rpc D:4e49110cce3d created by task D:9f1127ca7f99 Dec 6 10:50:54 xcp-ng-host xapi: [debug||3555 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:2e3c7d2e3b04 created by task D:9f1127ca7f99 Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||memory] squeezed version 24.19.2 starting Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] Parsing [http] Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] use-switch = true (true if the message switch is to be enabled) Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] switch-path = /var/run/message-switch/sock (Unix domain socket path on localhost where the message switch is listening) Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] search-path = (Search path for resources) Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] pidfile = /var/run/squeezed.pid (Filename to write process PID) Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] log = syslog:squeezed (Where to write log messages) Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] daemon = false (True if we are to daemonise) Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] disable-logging-for = http (A space-separated list of debug modules to suppress logging from) Dec 9 07:53:26 xcp-ng-host squeezed: [debug||0 ||squeezed] loglevel = debug (Log level)[xensource.txt](/forum/assets/uploads/files/1733758359408-xensource.txt)
-
@planedrop @probain I tried checking for power management and/or C-states in BIOS and I didn't see any settings related to those. I looked in CPU Configuration, Chipset Configuration, and didn't see anything there.
The only setting available on BIOS > Advanced > APCPI Configuration are:
PCIE Devices Power On [Disabled]
RTC Alarm Power On [By OS]I don't see Active-State Power Management or C-States.
-
I restarted the server and watched the log files up until the crash, which are attached here. This time there definitely seems to be something up, there was a bunch of null entries in the log files right when the crash happened.:
Dec 9 12:45:16 xcp-ng-host xapi: [debug||3483 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:66f38c9020de created by task D:9e902ea2f4f9 Dec 9 12:45:24 xcp-ng-host xapi: [debug||3484 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.logout D:89b6b89b97b4 created by task D:b2576741520e Dec 9 12:45:24 xcp-ng-host xapi: [ info||3484 /var/lib/xcp/xapi|session.logout D:31f3c633c030|xapi_session] Session.destroy trackid=40fcb26a14999de91feb67ecb9771bc4 Dec 9 12:45:24 xcp-ng-host xapi: [debug||3485 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.slave_login D:5d434bb6da87 created by task D:b2576741520e Dec 9 12:45:24 xcp-ng-host xapi: [ info||3485 /var/lib/xcp/xapi|session.slave_login D:91377f94f6db|xapi_session] Session.create trackid=9c3c9fb8e8cd899990ec90cc939c4a0c pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Dec 9 12:45:24 xcp-ng-host xapi: [debug||3486 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:pool.get_all D:d89558a6c493 created by task D:91377f94f6db Dec 9 12:45:24 xcp-ng-host xapi: [debug||3487 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:9018b4d47aa2 created by task D:b2576741520e Dec 9 12:45:42 xcp-ng-host xapi: [debug||3490 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.logout D:b3c50aed0bdd created by task D:001a2b86b7e7 Dec 9 12:45:42 xcp-ng-host xapi: [ info||3490 /var/lib/xcp/xapi|session.logout D:182495298773|xapi_session] Session.destroy trackid=f7523433dad5baa1f212e9bf56450726 Dec 9 12:45:42 xcp-ng-host xapi: [debug||356 |watching networks for NBD-related changes D:001a2b86b7e7|network_event_loop] Not updating the firewall, because the set of interfaces to use for NBD did not change: [] Dec 9 12:45:47 xcp-ng-host xapi: [debug||3491 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.slave_login D:654bf5b32b3b created by task D:001a2b86b7e7 Dec 9 12:45:47 xcp-ng-host xapi: [ info||3491 /var/lib/xcp/xapi|session.slave_login D:966d08cb98ae|xapi_session] Session.create trackid=860c6ab7ca617a23222174cf41168464 pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Dec 9 12:45:47 xcp-ng-host xapi: [debug||3492 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:pool.get_all D:8dc884754841 created by task D:966d08cb98ae Dec 9 12:45:47 xcp-ng-host xapi: [debug||3493 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:a92ebd9d4e50 created by task D:001a2b86b7e7 <null><null><null><null><null><null><null><null><null><null><null><null><null><null><null><null>
The line of NULLS seems to not want to show up here so here's a screenshot of what the logs look like in my VS Code ide of the log files. I've also attached the log file here again.
Here is the log file trimmed to the relevant sections, you can see the lines of NULLS on line 9135.
xensource_12_09.txt -
Well, unfortunately I got nothin... Extremely weird indeed
-
Yeah wish I had a better response here but this is indeed odd.
Do you by chance have a PCIe ethernet card you can swap in to use for connectivity (and just not use the X550 ports), just to test and see if the X550 is causing the crashes.
It's a longshot though if I'm honest.
-
IHMO, memtest failure are pointing a hardware issue but which component? In general, I'm removing or disabling devices one by one until it runs without any error.
-
@olivierlambert Yeah @R2rho I am with this, it's strange to see memtest errors at all.
May be another component causing the failures though, and not the RAM itself. Possibly the board or the mem controller on the CPU.
You don't by chance have another AM4 CPU you can swap in do you?
-
Yeah defective CPU can do this, or bent pins on the motherboard too.
-
@olivierlambert Yup, I've had exactly that a few times, usually on used boards.
@R2rho if possible, however annoying, I would also take the CPU out and check for pins on the motherboard being bent with a flashlight.
-
Thank you guys for the feedback. Strangely enough, I have two of these exact same servers as I was attempting to configure them as a pool. I installed XCP-NG on them separately and am having the exact same issue on both servers. They just lock up and stop responding. It could be a hardware issue, especially since I did see the memtest failures, but seems weird if its happening on both. I initially thought it was a RAM incompatibility issue because I added RAM to these after they arrived and saw all of these issues. But I've since removed the additional RAM and went back to what it had originally, but still having the issues.
I'm probably not going to remove the CPU because I will most likely return these, but I am going to install Ubuntu and see if they continue to be problematic. If that doesn't have any issues, then I think there's some underlying incompatibility with this AsRock Rack that probably needs further diagnosing and evaluation. Either way I'll probably go with something else.
-
@planedrop @olivierlambert @probain so I installed Ubuntu 22.04 on these last night and came back to the same frozen lockup as I was having with XCP-NG so it looks like I somehow received two equivalent servers from OnLogic that were both faulty to some degree. So definitely not an issue with XCP-NG in this case. Thank you for your help, I will be processing a return on these servers and go with a different product altogether.
-
@R2rho
Faulty gear always sucks. But who would've guessed that two separate systems would produce the same problems. That is highly unlikely, but never impossible.Good luck with the RMA
-
@R2rho Yeah that is really surprising.
I suppose it could be some kind of wider hardware incompatibility or something, but still crazy either way.
Glad you got that somewhat sorted out though.
-
Thanks a lot for the feedback. Shit happens, we usually took hardware for granted, and it's not
-
@R2rho We were building dozens ASRock Rack mainboard- and barebone based systems over the past few years. Starting with the X470D4U which worked realy great. Since the X570D4, it started to get messy. The B650D4U is also affected. We had random periodic reboots and freezes, mostly after some weeks or months uptime.
Interestingly we have identical systems which have an uptime of over a year. I would say, about 60% of the systems were affected.
BIOS version and attached hardware did not really matter.
I once contacted the ASRock support, but they did not know of a general problem, instead they suggested to check other components. (which we also did)
We went the RMA way and we even had some exchanged RMA mainboards, which also were faulty.
But: The most recent mainboard returning from RMA seems to work...so maybe you`re lucky
-
@dave That's pretty brutal honestly, I'm thinking about just calling it a day and moving away from Asrock servers entirely. I'm looking to set XCP-NG up on some IOT/Edge servers on some short-depth racks in a factory environment, so I really liked the form factor of these from OnLogic, but I've had the worst experience, and seeing your feedback definitely makes me want to go a different direction. I'm looking at some short-depth servers from SuperMicro geared specifically for IOT/Edge that I think will work out much better.