XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Help: Clean shutdown of Host, now no network or VMs are detected

    Scheduled Pinned Locked Moved Solved Compute
    31 Posts 6 Posters 5.8k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C Offline
      CodeMercenary @Danp
      last edited by

      Danp I do see quite a few errors in the xensource log file and some of them were from before the reboot. From what I can tell the errors prior to the reboot are issues with writing backup files to a remote store, not related to the OS functionality.

      These are the errors I see in the logs from the day this started. I removed the duplications since a lot of these were repeated, some identically, some with minor variations like IDs or GUIDs.

      Jul  1 16:05:35 host xenopsd-xc: [error||8 ||xenops] sending QMP command '{"execute":"query-migratable"}' to domain 19: (Sys_error "Connection reset by peer") (File "xc/device_common.ml", line 557, characters 35-42)
      Jul  1 16:05:39 host xenopsd-xc: [error||18 |Parallel:task=42294.atoms=9.(VBD.epoch_end vm=c6536cfd-6808-0644-fd06-c65b532cd8c1)|xenops] Failed to read /vm/c6536cfd-6808-0644-fd06-c65b532cd8c1/domains: has this domain already been cleaned up?
      Jul  1 16:05:39 host xenopsd-xc: [error||36 ||xenops] Failed to read /vm/c6536cfd-6808-0644-fd06-c65b532cd8c1/domains: has this domain already been cleaned up?
      Jul  1 16:06:51 host xapi: [error||4432483 :::80||backtrace] session.login_with_password D:<deleted> failed with exception Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
      Jul  1 16:06:51 host xapi: [error||4432483 :::80||backtrace] Raised Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
      Jul  1 16:06:51 host xapi: [error||4432483 :::80||backtrace] 1/8 xapi Raised at file ocaml/xapi/xapi_session.ml, line 517
      
      Jul  1 22:01:40 host xapi: [error||839 ||backtrace] Raised Server_error(SESSION_INVALID, [ OpaqueRef:35f18bea-f3b4-4536-b80b-71e09d35d713 ])
      Jul  1 22:01:40 host xapi: [error||839 ||backtrace] 1/4 xapi Raised at file ocaml/xapi-client/client.ml, line 7
      Jul  1 22:01:40 host xapi: [error||839 ||xenops] events_from_xapi: missing from the cache: [ 9b7a8f05-aefb-4ec5-9e44-410502d90bc2 ]
      
      Jul  1 22:09:06 host xapi: [error||4535668 :::80||backtrace] host.prepare_for_poweroff R:a7a27b85c8b5 failed with exception Server_error(VDI_IN_USE, [ OpaqueRef:4ef285f4-11de-4650-915e-b8972e1c2889; OpaqueRef:68153955-508d-4d87-bb22-69993c8e61b5; OpaqueRef:e117a42d-cbd7-436d-b615-64da01fd366f; OpaqueRef:3e0d09b4-3b4b-4be9-96a0-31582323867a; OpaqueRef:968e725d-05cc-4f0f-9646-bc0319d53eba ])
      Jul  1 22:09:06 host xapi: [error||4535668 :::80||backtrace] Raised Server_error(VDI_IN_USE, [ OpaqueRef:4ef285f4-11de-4650-915e-b8972e1c2889; OpaqueRef:68153955-508d-4d87-bb22-69993c8e61b5; OpaqueRef:e117a42d-cbd7-436d-b615-64da01fd366f; OpaqueRef:3e0d09b4-3b4b-4be9-96a0-31582323867a; OpaqueRef:968e725d-05cc-4f0f-9646-bc0319d53eba ])
      Jul  1 22:09:06 host xapi: [error||4535668 :::80||backtrace] 1/6 xapi Raised at file ocaml/xapi/rbac.ml, line 205
      Jul  1 22:18:50 host xenopsd-xc: [error||11 |events|task_server] Task 0 failed; Xenops_interface.Xenopsd_error([S(Does_not_exist);[S(VM);S(9b7a8f05-aefb-4ec5-9e44-410502d90bc2/config)]])
      
      Jul  1 22:18:51 host xapi: [error||8 /var/lib/xcp/xapi|session.login_with_password D:a9c08f7ce764|master_connection] Caught Master_connection.Goto_handler
      Jul  1 22:18:51 host xapi: [error||8 /var/lib/xcp/xapi|session.login_with_password D:a9c08f7ce764|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message).
      
      Jul  1 22:22:50 host xapi: [error||19 /var/lib/xcp/xapi||cli] Converting exception HOST_HAS_NO_MANAGEMENT_IP: [  ] into a CLI response
      Jul  1 22:22:50 host xapi: [error||19 /var/lib/xcp/xapi||backtrace] Raised Server_error(HOST_HAS_NO_MANAGEMENT_IP, [  ])
      Jul  1 22:22:50 host xapi: [error||19 /var/lib/xcp/xapi||backtrace] 1/1 xapi Raised at file (Thread 19 has no backtrace table. Was with_backtraces called?, line 0
      
      Jul  2 02:58:16 host xapi: [error||0 ||smint] SM.feature: unknown feature ATOMIC_PAUSE
      Jul  2 02:58:17 host xapi: message repeated 4 times: [ [error||0 ||smint] SM.feature: unknown feature ATOMIC_PAUSE]
      Jul  2 02:58:17 host xapi: [error||0 ||smint] SM.feature: unknown feature SR_CACHING
      
      Jul  2 02:58:23 host xcp-networkd: [error||58 |dbsync (update_env) R:981fff2b6906|network_utils] Error in read one line of file: /sys/class/net/eth2/device/sriov_totalvfs, exception Unix.Unix_error(Unix.ENOENT, "open", "/sys/class/net/eth2/device/sriov_totalvfs")\x0ARaised by primitive operation at file "lib/xapi-stdext-unix/unixext.ml", line 68, characters 11-40\x0ACalled from file "lib/xapi-stdext-unix/unixext.ml" (inlined), line 155, characters 31-83\x0ACalled from file "lib/xapi-stdext-unix/unixext.ml", line 157, characters 47-73\x0ACalled from file "lib/network_utils.ml", line 155, characters 6-33\x0A
      Jul  2 02:58:24 host xapi: [error||0 |dbsync (update_env) R:981fff2b6906|xapi] Failed to create NVidia compat config_file: (Sys_error "/usr/share/nvidia/vgx: No such file or directory")\x0ARaised by primitive operation at file "ocaml/xapi/xapi_vgpu_type.ml", line 1023, characters 35-57\x0ACalled from file "ocaml/xapi/xapi_vgpu_type.ml", line 1039, characters 8-32\x0A\x0A
      Jul  2 02:58:38 host xapi: [error||388 |xapi events D:624974770252|xenops] events_from_xapi: missing from the cache: [ 9b7a8f05-aefb-4ec5-9e44-410502d90bc2 ]
      

      These are some of the errors I'm seeing in the logs now, also cleaned up a bit.

      Jul  8 18:39:13 host xapi: [error||2700425 :::80||backtrace] session.login_with_password D:<deleted> failed with exception Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
      Jul  8 18:39:13 host xapi: [error||2700425 :::80||backtrace] Raised Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
      Jul  8 18:39:13 host xapi: [error||2700425 :::80||backtrace] 1/8 xapi Raised at file ocaml/xapi/xapi_session.ml, line 517
      Jul  8 21:39:14 host xenopsd-xc: [error||8 ||xenops] sending QMP command '{"execute":"query-migratable"}' to domain 8: (Sys_error "Connection reset by peer") (File "xc/device_common.ml", line 557, characters 35-42)
      Jul  8 21:39:16 host xenopsd-xc: [error||22 |Parallel:task=20491.atoms=1.(VBD.epoch_end vm=<guid>)|xenops] Failed to read /vm/<guid>/domains: has this domain already been cleaned up?
      Jul  8 21:39:16 host xenopsd-xc: [error||32 ||xenops] Failed to read /vm/<guid>/domains: has this domain already been cleaned up?
      Jul  8 22:05:17 host xapi: [error||2720118 :::80||backtrace] host.call_plugin R:8bd9f27c97ae failed with exception Server_error(-1, [ The updater plugin is busy (current operation: check_update); ; Traceback (most recent call last):\x0A  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 119, in wrapper\x0A    return func(*args, **kwds)\x0A  File "/etc/xapi.d/plugins/updater.py", line 95, in decorator\x0A    with OperationLocker(*pid_args, **pid_kwargs):\x0A  File "/etc/xapi.d/plugins/xcpngutils/filelocker.py", line 55, in __enter__\x0A    self.lock()\x0A  File "/etc/xapi.d/plugins/updater.py", line 55, in lock\x0A    self._raise_busy()\x0A  File "/etc/xapi.d/plugins/updater.py", line 74, in _raise_busy\x0A    raise OperationException('The updater plugin is busy (current operation: {})'.format(op))\x0AOperationException: The updater plugin is busy (current operation: check_update)\x0A ])
      Jul  8 22:05:17 host xapi: [error||2720118 :::80||backtrace] Raised Server_error(-1, [ The updater plugin is busy (current operation: check_update); ; Traceback (most recent call last):\x0A  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 119, in wrapper\x0A    return func(*args, **kwds)\x0A  File "/etc/xapi.d/plugins/updater.py", line 95, in decorator\x0A    with OperationLocker(*pid_args, **pid_kwargs):\x0A  File "/etc/xapi.d/plugins/xcpngutils/filelocker.py", line 55, in __enter__\x0A    self.lock()\x0A  File "/etc/xapi.d/plugins/updater.py", line 55, in lock\x0A    self._raise_busy()\x0A  File "/etc/xapi.d/plugins/updater.py", line 74, in _raise_busy\x0A    raise OperationException('The updater plugin is busy (current operation: {})'.format(op))\x0AOperationException: The updater plugin is busy (current operation: check_update)\x0A ])
      Jul  8 22:05:17 host xapi: [error||2720118 :::80||backtrace] 1/6 xapi Raised at file ocaml/xapi/rbac.ml, line 205
      
      1 Reply Last reply Reply Quote 0
      • G Offline
        guiltykeyboard @Danp
        last edited by

        Danp

        Good advice.

        Occasionally our older servers crap themselves with this same issue and an emergency network reset on the pool master and then rebooting the other hosts once the pool master is back up usually resolves the issue.

        C 1 Reply Last reply Reply Quote 0
        • C Offline
          CodeMercenary @guiltykeyboard
          last edited by

          guiltykeyboard

          Thanks for the input. Good to know I might need to do the reset on the pool master, not just the host that's impacted. Fortunately, the impacted host is not too big a deal for me to reboot. The pool master will be more annoying since I don't use shared storage and my VMs are big enough that I'd rather not migrate them. Not a big deal to do it some evening or over a weekend.

          G 1 Reply Last reply Reply Quote 0
          • G Offline
            guiltykeyboard @CodeMercenary
            last edited by

            CodeMercenary Try an emergency network reset on the server that is directly having the issue and if that doesn't work, try reseting the toolstack on the pool master and then try a network reset there as well.

            Recommend that you set up two hosts with shared storage between the two so that you can live-migrate your VM's to the other host and elect it as master temporarily when you do maintenance.

            C 1 Reply Last reply Reply Quote 0
            • C Offline
              CodeMercenary @guiltykeyboard
              last edited by

              guiltykeyboard Thank you for the guidance. I'll try that. I still think it's super weird that the server did get back into the pool and seems to be working fine. It's just that the physical console still says it can't find the pool master and has no NICs installed.

              The main reason I don't have shared storage is due to concerns over cost. Years ago, I was thinking of setting up a SAN for my VMware servers and they were crazy expensive, way over the budget of the small company I work for. I think I stopped researching when I ran across an article that was titled something like "How to build your own SAN for less than $10,000" and I realized I was way out of my league.

              I do have a bigger budget now though I would not be able to spend $10k to build shared storage. Any recommendations of ways to do reliable shared storage without it being crazy expensive? One thing that helps now is that each of my servers has dual 10Gbe ethernet, something I didn't have the last time I looked into this.

              I've been keeping my eye on XOSTOR lately since I have storage in all of my servers. Unfortunately, from what I've seen, it requires three servers in the cluster and two of my servers have SSDs and the other has only HDDs so I suspect that third one would slow down the other two since a write isn't considered complete until all servers are done writing. XOSTORE feels safer than shared storage since shared storage would be a single point of failure. (Admittedly, right now I also have multiple single points of failure since I use local storage).

              G 1 Reply Last reply Reply Quote 0
              • G Offline
                guiltykeyboard @CodeMercenary
                last edited by

                CodeMercenary Have you restarted the toolstack now that it is doing what it needs to do?

                That might make the console show what it needs.

                Also try exiting out of the console and then starting it back up with xsconsole.

                C 1 Reply Last reply Reply Quote 0
                • C Offline
                  CodeMercenary @guiltykeyboard
                  last edited by

                  guiltykeyboard Restarting the toolstack didn't fix it.

                  When I use ctrl-C to exit the console it says it's resetting the console and the new instance still says it can't reach the pool master.

                  Looking like the network reset on the troublesome host is the next step, then maybe a reboot. Then I can try your suggestions on the pool master. Just a little gun-shy with messing with the master because of this happening to a server that should have just worked.

                  G 1 Reply Last reply Reply Quote 0
                  • G Offline
                    guiltykeyboard @CodeMercenary
                    last edited by

                    CodeMercenary Try the toolstack reset on that host.

                    Maybe do a toolstack reset on the pool master as well - which doesn't affect running VM's.

                    If that doesn't work, try an emergency network reset on the host having trouble.

                    C 1 Reply Last reply Reply Quote 0
                    • C Offline
                      CodeMercenary @guiltykeyboard
                      last edited by

                      guiltykeyboard Restarting the toolstack on the pool master did not fix it so I went ahead and did the emergency network reset on the host having the issue. It came up normally after the reboot. The emergency reset was super easy, Thank you for helping me with this. Still learning XCP-ng and trying to understand what's happening before trying things to fix it, so in the future I'll know more and hopefully be able to help other people.

                      G 1 Reply Last reply Reply Quote 0
                      • G Offline
                        guiltykeyboard @CodeMercenary
                        last edited by

                        CodeMercenary Glad it worked out for you.

                        1 Reply Last reply Reply Quote 0
                        • C Offline
                          CodeMercenary
                          last edited by

                          Another bit of strangeness. I just noticed that some older backup disaster recovery VMs were booted up on my pool master host. I looked at the stats and they all booted 4 hours ago, right around when I tried restarting the toolstack on the pool master. All of them were set to auto-start, an odd setting I think for disaster recovery VMs unless there is supposed to be something in place to stop them from auto-starting. Easy enough to shut down but kinda strange that they booted. Surely disaster recovery VMs aren't supposed to power up on restart, right?

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            They shouldn't yes. Were you using DR or CR?

                            C 1 Reply Last reply Reply Quote 0
                            • olivierlambertO olivierlambert marked this topic as a question on
                            • olivierlambertO olivierlambert has marked this topic as solved on
                            • C Offline
                              CodeMercenary @olivierlambert
                              last edited by

                              olivierlambert It was DR. I was testing DR a while ago and after running it once I disabled the backup job so these backups have just been sitting on the server. I don't think I've rebooted that server since running that backup.

                              1 Reply Last reply Reply Quote 1
                              • nick.lloydN nick.lloyd referenced this topic on
                              • C Offline
                                CodeMercenary
                                last edited by

                                Just want to document that this happened again, on the same host.

                                My XO (from source) that manages my backups, ran the backups last night. I know it was active up until at least around 5:30am but by the time I got into the office it was inaccessible by browser, ssh and ping. Other XO instances showed that it was running but the Console tab didn't give me access to its console.

                                A few hours later I found that other VMs on that same host had become inaccessible in the same fashion and also had no console showing in XO.

                                An hour or two later I found that XO showed the host as being missing from the pool, which it had not been earlier in the day. When I checked the physical console for that host, I found it had red text "<hostname> login: root (automatic login)" and did not respond to the keyboard other than putting more red text on screen of whatever I typed. I hit ctrl-alt-del and it didn't seem to do anything so I typed random things, then XCP-ng started rebooting. I'm guessing it was 60 to 120 seconds after my first ctrl-alt-delete.

                                When it came back up it could not find the pool master and said it had no network interfaces. I was able to solve it by doing another emergency network reset.

                                Would be nice if this wouldn't happen, makes me super nervous about stability. Thankful that the two times this has happened, it was on my least mission critical server. However, it's the server that handles backups so it's still stressful to have it go down. Also makes me wonder if there might be something wrong with that server's hardware.

                                1 Reply Last reply Reply Quote 0
                                • olivierlambertO Offline
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by

                                  You need to check the logs because obviously it's not normal. Check if you have a /var/crash folder with some stuff in it. Otherwise, the usual logs to check (on both xen and dom0 side).

                                  C 1 Reply Last reply Reply Quote 0
                                  • C Offline
                                    CodeMercenary @olivierlambert
                                    last edited by

                                    olivierlambert I do have a /var/crash folder but it has nothing it in except a file from a year ago named .sacrificial-space-for-logs.

                                    The Xen logs are verbose. Any suggestions of text to grep to find what I'm looking for? Other than the obvious error I should search for.

                                    Currently looking in xensource.log.* for error lines to see if I can figure anything out.

                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates 🪐 Co-Founder CEO
                                      last edited by

                                      kernel log and Xen log, check for anything around the time of the issue

                                      C 1 Reply Last reply Reply Quote 0
                                      • C Offline
                                        CodeMercenary @olivierlambert
                                        last edited by

                                        olivierlambert Having trouble finding the reboot in the log files because I don't know what to look for. I have nearly 1GB of log files from that day and unfortunately, I don't recall when the reboot happened. Is there something I can grep for in the log that would indicate the reboot and I can backtrack from there?

                                        Tried feeding the GB of log files to Phi4 in my Ollama server but so far it has not been any help finding anything either. Well, it found how to make my office a lot louder by running the server fans at full speed for a few minutes but that wasn't helpful.

                                        C 1 Reply Last reply Reply Quote 0
                                        • C Offline
                                          CodeMercenary @CodeMercenary
                                          last edited by

                                          Looking through the kern.log files I found stuff I thought might be interesting but as I scroll, I see it happening a lot so I then wonder if it's normal. I see sets of three events:
                                          Out of memory: Kill process ###
                                          Killed process ###
                                          oom_reaper: reaped process ###

                                          I wonder if it was starving for memory and having to kill off processes to survive then eventually died. I see these 90+ times in one log file and over 200 times in another. Don't know if this is just normal activity or indication of a problem.

                                          1 Reply Last reply Reply Quote 0
                                          • K Offline
                                            kassebasse @CodeMercenary
                                            last edited by kassebasse

                                            CodeMercenary I had a simillar issue to you, check the time and the date on the host.

                                            If this might be any help to you: https://xcp-ng.org/forum/topic/10687/after-an-update-the-nic-s-has-disappeared-but-still-works-somehow/10?_=1743764792775

                                            1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post