XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Server Locks Up Periodically with ASRock X570D4I-2T AMD Ryzen 9 3900X and Intel X550-AT2

    Scheduled Pinned Locked Moved XCP-ng
    21 Posts 5 Posters 1.0k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R Offline
      R2rho
      last edited by

      @probain @planedrop

      I restarted the server and watched the log files up until the crash, which are attached here. This time there definitely seems to be something up, there was a bunch of null entries in the log files right when the crash happened.:

      Dec  9 12:45:16 xcp-ng-host xapi: [debug||3483 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:66f38c9020de created by task D:9e902ea2f4f9
      Dec  9 12:45:24 xcp-ng-host xapi: [debug||3484 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.logout D:89b6b89b97b4 created by task D:b2576741520e
      Dec  9 12:45:24 xcp-ng-host xapi: [ info||3484 /var/lib/xcp/xapi|session.logout D:31f3c633c030|xapi_session] Session.destroy trackid=40fcb26a14999de91feb67ecb9771bc4
      Dec  9 12:45:24 xcp-ng-host xapi: [debug||3485 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.slave_login D:5d434bb6da87 created by task D:b2576741520e
      Dec  9 12:45:24 xcp-ng-host xapi: [ info||3485 /var/lib/xcp/xapi|session.slave_login D:91377f94f6db|xapi_session] Session.create trackid=9c3c9fb8e8cd899990ec90cc939c4a0c pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
      Dec  9 12:45:24 xcp-ng-host xapi: [debug||3486 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:pool.get_all D:d89558a6c493 created by task D:91377f94f6db
      Dec  9 12:45:24 xcp-ng-host xapi: [debug||3487 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:9018b4d47aa2 created by task D:b2576741520e
      Dec  9 12:45:42 xcp-ng-host xapi: [debug||3490 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.logout D:b3c50aed0bdd created by task D:001a2b86b7e7
      Dec  9 12:45:42 xcp-ng-host xapi: [ info||3490 /var/lib/xcp/xapi|session.logout D:182495298773|xapi_session] Session.destroy trackid=f7523433dad5baa1f212e9bf56450726
      Dec  9 12:45:42 xcp-ng-host xapi: [debug||356 |watching networks for NBD-related changes D:001a2b86b7e7|network_event_loop] Not updating the firewall, because the set of interfaces to use for NBD did not change: []
      Dec  9 12:45:47 xcp-ng-host xapi: [debug||3491 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:session.slave_login D:654bf5b32b3b created by task D:001a2b86b7e7
      Dec  9 12:45:47 xcp-ng-host xapi: [ info||3491 /var/lib/xcp/xapi|session.slave_login D:966d08cb98ae|xapi_session] Session.create trackid=860c6ab7ca617a23222174cf41168464 pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
      Dec  9 12:45:47 xcp-ng-host xapi: [debug||3492 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:pool.get_all D:8dc884754841 created by task D:966d08cb98ae
      Dec  9 12:45:47 xcp-ng-host xapi: [debug||3493 /var/lib/xcp/xapi|post_root|dummytaskhelper] task dispatch:event.from D:a92ebd9d4e50 created by task D:001a2b86b7e7
      <null><null><null><null><null><null><null><null><null><null><null><null><null><null><null><null>
      

      The line of NULLS seems to not want to show up here so here's a screenshot of what the logs look like in my VS Code ide of the log files. I've also attached the log file here again.

      26bda7ba-5cea-462c-b3d4-df1c99f963de-image.png

      Here is the log file trimmed to the relevant sections, you can see the lines of NULLS on line 9135.
      xensource_12_09.txt

      1 Reply Last reply Reply Quote 0
      • P Offline
        probain
        last edited by

        Well, unfortunately I got nothin... Extremely weird indeed

        1 Reply Last reply Reply Quote 0
        • planedropP Offline
          planedrop Top contributor
          last edited by

          Yeah wish I had a better response here but this is indeed odd.

          Do you by chance have a PCIe ethernet card you can swap in to use for connectivity (and just not use the X550 ports), just to test and see if the X550 is causing the crashes.

          It's a longshot though if I'm honest.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Online
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            IHMO, memtest failure are pointing a hardware issue but which component? In general, I'm removing or disabling devices one by one until it runs without any error.

            planedropP 1 Reply Last reply Reply Quote 1
            • planedropP Offline
              planedrop Top contributor @olivierlambert
              last edited by

              @olivierlambert Yeah @R2rho I am with this, it's strange to see memtest errors at all.

              May be another component causing the failures though, and not the RAM itself. Possibly the board or the mem controller on the CPU.

              You don't by chance have another AM4 CPU you can swap in do you?

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Online
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Yeah defective CPU can do this, or bent pins on the motherboard too.

                planedropP 1 Reply Last reply Reply Quote 1
                • planedropP Offline
                  planedrop Top contributor @olivierlambert
                  last edited by

                  @olivierlambert Yup, I've had exactly that a few times, usually on used boards.

                  @R2rho if possible, however annoying, I would also take the CPU out and check for pins on the motherboard being bent with a flashlight.

                  1 Reply Last reply Reply Quote 1
                  • R Offline
                    R2rho
                    last edited by

                    Thank you guys for the feedback. Strangely enough, I have two of these exact same servers as I was attempting to configure them as a pool. I installed XCP-NG on them separately and am having the exact same issue on both servers. They just lock up and stop responding. It could be a hardware issue, especially since I did see the memtest failures, but seems weird if its happening on both. I initially thought it was a RAM incompatibility issue because I added RAM to these after they arrived and saw all of these issues. But I've since removed the additional RAM and went back to what it had originally, but still having the issues.

                    I'm probably not going to remove the CPU because I will most likely return these, but I am going to install Ubuntu and see if they continue to be problematic. If that doesn't have any issues, then I think there's some underlying incompatibility with this AsRock Rack that probably needs further diagnosing and evaluation. Either way I'll probably go with something else.

                    1 Reply Last reply Reply Quote 0
                    • R Offline
                      R2rho
                      last edited by

                      @planedrop @olivierlambert @probain so I installed Ubuntu 22.04 on these last night and came back to the same frozen lockup as I was having with XCP-NG so it looks like I somehow received two equivalent servers from OnLogic that were both faulty to some degree. So definitely not an issue with XCP-NG in this case. Thank you for your help, I will be processing a return on these servers and go with a different product altogether.

                      P planedropP daveD 3 Replies Last reply Reply Quote 2
                      • P Offline
                        probain @R2rho
                        last edited by

                        @R2rho
                        Faulty gear always sucks. But who would've guessed that two separate systems would produce the same problems. That is highly unlikely, but never impossible.

                        Good luck with the RMA

                        1 Reply Last reply Reply Quote 1
                        • planedropP Offline
                          planedrop Top contributor @R2rho
                          last edited by

                          @R2rho Yeah that is really surprising.

                          I suppose it could be some kind of wider hardware incompatibility or something, but still crazy either way.

                          Glad you got that somewhat sorted out though.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Online
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            Thanks a lot for the feedback. Shit happens, we usually took hardware for granted, and it's not 😞

                            1 Reply Last reply Reply Quote 1
                            • daveD Offline
                              dave @R2rho
                              last edited by

                              @R2rho We were building dozens ASRock Rack mainboard- and barebone based systems over the past few years. Starting with the X470D4U which worked realy great. Since the X570D4, it started to get messy. The B650D4U is also affected. We had random periodic reboots and freezes, mostly after some weeks or months uptime.

                              Interestingly we have identical systems which have an uptime of over a year. I would say, about 60% of the systems were affected.

                              BIOS version and attached hardware did not really matter.

                              I once contacted the ASRock support, but they did not know of a general problem, instead they suggested to check other components. (which we also did)

                              We went the RMA way and we even had some exchanged RMA mainboards, which also were faulty.

                              But: The most recent mainboard returning from RMA seems to work...so maybe you`re lucky 🙂

                              R 1 Reply Last reply Reply Quote 1
                              • R Offline
                                R2rho @dave
                                last edited by

                                @dave That's pretty brutal honestly, I'm thinking about just calling it a day and moving away from Asrock servers entirely. I'm looking to set XCP-NG up on some IOT/Edge servers on some short-depth racks in a factory environment, so I really liked the form factor of these from OnLogic, but I've had the worst experience, and seeing your feedback definitely makes me want to go a different direction. I'm looking at some short-depth servers from SuperMicro geared specifically for IOT/Edge that I think will work out much better.

                                daveD 1 Reply Last reply Reply Quote 0
                                • daveD Offline
                                  dave @R2rho
                                  last edited by

                                  @R2rho yeah, there are Supermicro systems with AM5 which can handle a decent amount of load, like based on the h13sae-mf, like:

                                  https://www.supermicro.com/de/products/system/mainstream/1u/as-1015a-mt
                                  (with less depth)

                                  Seem to be stable, but we have a small issue regarding onboard graphics ATM:

                                  https://xcp-ng.org/forum/topic/9976/black-screen-after-install-on-supermicro-h13sae-mf-with-ryzen-9950x/3?_=1734419502978

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post