Help: Clean shutdown of Host, now no network or VMs are detected

CodeMercenary

I cleanly shut down one of my three hosts (not the pool master) to install DIMM blanks because I was missing some. No hardware changes. When I booted it back up, XO doesn't show the host as turned on. When I check the host directly via the physical console, it says it has no running VMs. When I use All VMs to see if it has some but they just aren't running, I get the error
"'NoneType' object has no attribute 'xenapi'"

The physical console reports that it has no network interfaces, however I can see that the network is connected from the switch-side management console and I can ping all the IPs I expect to be live and can SSH to the host. It says that the pool master is unreachable though it is connected to the same network switch as the master, the same one they've always been connected to. It says that the host CPU usage and memory usage are "<Unavailable>".

The VMs are installed in local SRs on arrays separate from the mirror where XCP-NG 8.2.1 is installed.

The host is fully patched, I think the most recent patches were last week or the week before, can't quite remember.

I can SSH into the host as well as having access to the physical console.

The pool is still working because the pool master is a different host.

I have backups of the pool configuration but they were made using the XO (from source) installed on the host that is now down. I have one other XO install (from source) that isn't on that host. Those backups of the config are made to multiple remotes so I should be able to get to them if needed.

Fortunately, it's my least vital host, however, I'm now rather freaked out that cleanly shutting down and restarting a host can put it into an unusable state.

How do I debug this and get this host back online? Does it make sense to just try rebooting it or does that error indicate there's something very wrong that will still be wrong after the reboot?

CodeMercenary

@guiltykeyboard Restarting the toolstack on the pool master did not fix it so I went ahead and did the emergency network reset on the host having the issue. It came up normally after the reboot. The emergency reset was super easy, Thank you for helping me with this. Still learning XCP-ng and trying to understand what's happening before trying things to fix it, so in the future I'll know more and hopefully be able to help other people.

nick.lloyd

@CodeMercenary I'd certainly give it a reboot, it can't hurt if there's no VMs on it. There were patches released a few days ago as well!

CodeMercenary

@Nick-085 I updated all three hosts recently and the other two hosts say they are up to date so I assume this one would be too.

There are VMs on it, it just can't see them right now for whatever reason. They aren't production critical but a couple of them would really suck to lose.

nick.lloyd

@CodeMercenary Ahh okay. One of my hosts dropped its network interfaces too, similar to what you stated. Unfortunately I had to shut my hosts down for a power upgrade at my colocation, but I should have them back up tonight and I'll report back what I find. What's the hardware on the affected host (just in case I find something similar between our setups)?

CodeMercenary

@Nick-085 That particular machine is a PowerEdge R740xd with four 1GBe ports and two 10GBe ports.

Well, here's an interesting development. That host is now back in the pool. None of the VMs are running, including the two that are marked to auto power on.

Wondering if there's a way to find out what happened and why it's suddenly working but didn't start up right.

CodeMercenary

@Nick-085 More weirdness. The physical console still says it can't contact the pool master and that it has no network interfaces. Something very weird is going on. XO shows the host in the pool and shows the current running stats but the host itself says it's not talking to the pool and has no network connection. From the XO interface, you'd think there's nothing wrong, that was not the case last night when XO showed that the host was off.

Danp

@CodeMercenary Have you considered performing an emergency network reset on the problematic host?

https://xcp-ng.org/docs/networking.html#emergency-network-reset

CodeMercenary

@Danp I have considered it, yes. Because this is not a business critical server, I didn't want to mess with anything in case there was a way to figure out why it did this. Otherwise, I will never want to power cycle any of my servers again, especially the production servers. Had this been one of our production servers then I would have had a VERY long night last night.

If that's the solution to the problem then that's what I'll do but I'd really like to have some idea if this is just expected behavior or if something is actually wrong.

Danp

@CodeMercenary The only way to know for sure it to review the logs. It's possible that there were undetected issues prior to the shutdown, especially if the host hadn't been power cycled in some time.

CodeMercenary

@Danp It was power cycled 17 days prior to this issue, when I added more RAM and upgraded the CPUs.

I'm new to XCP, any advice on which logs to check and what to look for? I'm looking at the troubleshooting section of the docs, just didn't know if there are particular logs I should check. Looking for low hanging fruit, not to spend the rest of the week chasing this, or to absorb a bunch of your time. Would just be nice to find it if it's not buried too deeply.

Danp

I'd probably start with /var/log/xensource.log since it appears to be XAPI related. You may need to look at one of the related files (ie: xensource.log.1, xensource.log.2.gz, etc) if the logs have already been rotated.

CodeMercenary

@Danp I do see quite a few errors in the xensource log file and some of them were from before the reboot. From what I can tell the errors prior to the reboot are issues with writing backup files to a remote store, not related to the OS functionality.

These are the errors I see in the logs from the day this started. I removed the duplications since a lot of these were repeated, some identically, some with minor variations like IDs or GUIDs.

Jul  1 16:05:35 host xenopsd-xc: [error||8 ||xenops] sending QMP command '{"execute":"query-migratable"}' to domain 19: (Sys_error "Connection reset by peer") (File "xc/device_common.ml", line 557, characters 35-42)
Jul  1 16:05:39 host xenopsd-xc: [error||18 |Parallel:task=42294.atoms=9.(VBD.epoch_end vm=c6536cfd-6808-0644-fd06-c65b532cd8c1)|xenops] Failed to read /vm/c6536cfd-6808-0644-fd06-c65b532cd8c1/domains: has this domain already been cleaned up?
Jul  1 16:05:39 host xenopsd-xc: [error||36 ||xenops] Failed to read /vm/c6536cfd-6808-0644-fd06-c65b532cd8c1/domains: has this domain already been cleaned up?
Jul  1 16:06:51 host xapi: [error||4432483 :::80||backtrace] session.login_with_password D:<deleted> failed with exception Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
Jul  1 16:06:51 host xapi: [error||4432483 :::80||backtrace] Raised Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
Jul  1 16:06:51 host xapi: [error||4432483 :::80||backtrace] 1/8 xapi Raised at file ocaml/xapi/xapi_session.ml, line 517

Jul  1 22:01:40 host xapi: [error||839 ||backtrace] Raised Server_error(SESSION_INVALID, [ OpaqueRef:35f18bea-f3b4-4536-b80b-71e09d35d713 ])
Jul  1 22:01:40 host xapi: [error||839 ||backtrace] 1/4 xapi Raised at file ocaml/xapi-client/client.ml, line 7
Jul  1 22:01:40 host xapi: [error||839 ||xenops] events_from_xapi: missing from the cache: [ 9b7a8f05-aefb-4ec5-9e44-410502d90bc2 ]

Jul  1 22:09:06 host xapi: [error||4535668 :::80||backtrace] host.prepare_for_poweroff R:a7a27b85c8b5 failed with exception Server_error(VDI_IN_USE, [ OpaqueRef:4ef285f4-11de-4650-915e-b8972e1c2889; OpaqueRef:68153955-508d-4d87-bb22-69993c8e61b5; OpaqueRef:e117a42d-cbd7-436d-b615-64da01fd366f; OpaqueRef:3e0d09b4-3b4b-4be9-96a0-31582323867a; OpaqueRef:968e725d-05cc-4f0f-9646-bc0319d53eba ])
Jul  1 22:09:06 host xapi: [error||4535668 :::80||backtrace] Raised Server_error(VDI_IN_USE, [ OpaqueRef:4ef285f4-11de-4650-915e-b8972e1c2889; OpaqueRef:68153955-508d-4d87-bb22-69993c8e61b5; OpaqueRef:e117a42d-cbd7-436d-b615-64da01fd366f; OpaqueRef:3e0d09b4-3b4b-4be9-96a0-31582323867a; OpaqueRef:968e725d-05cc-4f0f-9646-bc0319d53eba ])
Jul  1 22:09:06 host xapi: [error||4535668 :::80||backtrace] 1/6 xapi Raised at file ocaml/xapi/rbac.ml, line 205
Jul  1 22:18:50 host xenopsd-xc: [error||11 |events|task_server] Task 0 failed; Xenops_interface.Xenopsd_error([S(Does_not_exist);[S(VM);S(9b7a8f05-aefb-4ec5-9e44-410502d90bc2/config)]])

Jul  1 22:18:51 host xapi: [error||8 /var/lib/xcp/xapi|session.login_with_password D:a9c08f7ce764|master_connection] Caught Master_connection.Goto_handler
Jul  1 22:18:51 host xapi: [error||8 /var/lib/xcp/xapi|session.login_with_password D:a9c08f7ce764|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message).

Jul  1 22:22:50 host xapi: [error||19 /var/lib/xcp/xapi||cli] Converting exception HOST_HAS_NO_MANAGEMENT_IP: [  ] into a CLI response
Jul  1 22:22:50 host xapi: [error||19 /var/lib/xcp/xapi||backtrace] Raised Server_error(HOST_HAS_NO_MANAGEMENT_IP, [  ])
Jul  1 22:22:50 host xapi: [error||19 /var/lib/xcp/xapi||backtrace] 1/1 xapi Raised at file (Thread 19 has no backtrace table. Was with_backtraces called?, line 0

Jul  2 02:58:16 host xapi: [error||0 ||smint] SM.feature: unknown feature ATOMIC_PAUSE
Jul  2 02:58:17 host xapi: message repeated 4 times: [ [error||0 ||smint] SM.feature: unknown feature ATOMIC_PAUSE]
Jul  2 02:58:17 host xapi: [error||0 ||smint] SM.feature: unknown feature SR_CACHING

Jul  2 02:58:23 host xcp-networkd: [error||58 |dbsync (update_env) R:981fff2b6906|network_utils] Error in read one line of file: /sys/class/net/eth2/device/sriov_totalvfs, exception Unix.Unix_error(Unix.ENOENT, "open", "/sys/class/net/eth2/device/sriov_totalvfs")\x0ARaised by primitive operation at file "lib/xapi-stdext-unix/unixext.ml", line 68, characters 11-40\x0ACalled from file "lib/xapi-stdext-unix/unixext.ml" (inlined), line 155, characters 31-83\x0ACalled from file "lib/xapi-stdext-unix/unixext.ml", line 157, characters 47-73\x0ACalled from file "lib/network_utils.ml", line 155, characters 6-33\x0A
Jul  2 02:58:24 host xapi: [error||0 |dbsync (update_env) R:981fff2b6906|xapi] Failed to create NVidia compat config_file: (Sys_error "/usr/share/nvidia/vgx: No such file or directory")\x0ARaised by primitive operation at file "ocaml/xapi/xapi_vgpu_type.ml", line 1023, characters 35-57\x0ACalled from file "ocaml/xapi/xapi_vgpu_type.ml", line 1039, characters 8-32\x0A\x0A
Jul  2 02:58:38 host xapi: [error||388 |xapi events D:624974770252|xenops] events_from_xapi: missing from the cache: [ 9b7a8f05-aefb-4ec5-9e44-410502d90bc2 ]

These are some of the errors I'm seeing in the logs now, also cleaned up a bit.

Jul  8 18:39:13 host xapi: [error||2700425 :::80||backtrace] session.login_with_password D:<deleted> failed with exception Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
Jul  8 18:39:13 host xapi: [error||2700425 :::80||backtrace] Raised Server_error(HOST_IS_SLAVE, [ aaa.bbb.ccc.ddd ])
Jul  8 18:39:13 host xapi: [error||2700425 :::80||backtrace] 1/8 xapi Raised at file ocaml/xapi/xapi_session.ml, line 517
Jul  8 21:39:14 host xenopsd-xc: [error||8 ||xenops] sending QMP command '{"execute":"query-migratable"}' to domain 8: (Sys_error "Connection reset by peer") (File "xc/device_common.ml", line 557, characters 35-42)
Jul  8 21:39:16 host xenopsd-xc: [error||22 |Parallel:task=20491.atoms=1.(VBD.epoch_end vm=<guid>)|xenops] Failed to read /vm/<guid>/domains: has this domain already been cleaned up?
Jul  8 21:39:16 host xenopsd-xc: [error||32 ||xenops] Failed to read /vm/<guid>/domains: has this domain already been cleaned up?
Jul  8 22:05:17 host xapi: [error||2720118 :::80||backtrace] host.call_plugin R:8bd9f27c97ae failed with exception Server_error(-1, [ The updater plugin is busy (current operation: check_update); ; Traceback (most recent call last):\x0A  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 119, in wrapper\x0A    return func(*args, **kwds)\x0A  File "/etc/xapi.d/plugins/updater.py", line 95, in decorator\x0A    with OperationLocker(*pid_args, **pid_kwargs):\x0A  File "/etc/xapi.d/plugins/xcpngutils/filelocker.py", line 55, in __enter__\x0A    self.lock()\x0A  File "/etc/xapi.d/plugins/updater.py", line 55, in lock\x0A    self._raise_busy()\x0A  File "/etc/xapi.d/plugins/updater.py", line 74, in _raise_busy\x0A    raise OperationException('The updater plugin is busy (current operation: {})'.format(op))\x0AOperationException: The updater plugin is busy (current operation: check_update)\x0A ])
Jul  8 22:05:17 host xapi: [error||2720118 :::80||backtrace] Raised Server_error(-1, [ The updater plugin is busy (current operation: check_update); ; Traceback (most recent call last):\x0A  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 119, in wrapper\x0A    return func(*args, **kwds)\x0A  File "/etc/xapi.d/plugins/updater.py", line 95, in decorator\x0A    with OperationLocker(*pid_args, **pid_kwargs):\x0A  File "/etc/xapi.d/plugins/xcpngutils/filelocker.py", line 55, in __enter__\x0A    self.lock()\x0A  File "/etc/xapi.d/plugins/updater.py", line 55, in lock\x0A    self._raise_busy()\x0A  File "/etc/xapi.d/plugins/updater.py", line 74, in _raise_busy\x0A    raise OperationException('The updater plugin is busy (current operation: {})'.format(op))\x0AOperationException: The updater plugin is busy (current operation: check_update)\x0A ])
Jul  8 22:05:17 host xapi: [error||2720118 :::80||backtrace] 1/6 xapi Raised at file ocaml/xapi/rbac.ml, line 205

guiltykeyboard

@Danp

Good advice.

Occasionally our older servers crap themselves with this same issue and an emergency network reset on the pool master and then rebooting the other hosts once the pool master is back up usually resolves the issue.

CodeMercenary

@guiltykeyboard

Thanks for the input. Good to know I might need to do the reset on the pool master, not just the host that's impacted. Fortunately, the impacted host is not too big a deal for me to reboot. The pool master will be more annoying since I don't use shared storage and my VMs are big enough that I'd rather not migrate them. Not a big deal to do it some evening or over a weekend.

guiltykeyboard

@CodeMercenary Try an emergency network reset on the server that is directly having the issue and if that doesn't work, try reseting the toolstack on the pool master and then try a network reset there as well.

Recommend that you set up two hosts with shared storage between the two so that you can live-migrate your VM's to the other host and elect it as master temporarily when you do maintenance.

CodeMercenary

@guiltykeyboard Thank you for the guidance. I'll try that. I still think it's super weird that the server did get back into the pool and seems to be working fine. It's just that the physical console still says it can't find the pool master and has no NICs installed.

The main reason I don't have shared storage is due to concerns over cost. Years ago, I was thinking of setting up a SAN for my VMware servers and they were crazy expensive, way over the budget of the small company I work for. I think I stopped researching when I ran across an article that was titled something like "How to build your own SAN for less than $10,000" and I realized I was way out of my league.

I do have a bigger budget now though I would not be able to spend $10k to build shared storage. Any recommendations of ways to do reliable shared storage without it being crazy expensive? One thing that helps now is that each of my servers has dual 10Gbe ethernet, something I didn't have the last time I looked into this.

I've been keeping my eye on XOSTOR lately since I have storage in all of my servers. Unfortunately, from what I've seen, it requires three servers in the cluster and two of my servers have SSDs and the other has only HDDs so I suspect that third one would slow down the other two since a write isn't considered complete until all servers are done writing. XOSTORE feels safer than shared storage since shared storage would be a single point of failure. (Admittedly, right now I also have multiple single points of failure since I use local storage).

guiltykeyboard

@CodeMercenary Have you restarted the toolstack now that it is doing what it needs to do?

That might make the console show what it needs.

Also try exiting out of the console and then starting it back up with xsconsole.

CodeMercenary

@guiltykeyboard Restarting the toolstack didn't fix it.

When I use ctrl-C to exit the console it says it's resetting the console and the new instance still says it can't reach the pool master.

Looking like the network reset on the troublesome host is the next step, then maybe a reboot. Then I can try your suggestions on the pool master. Just a little gun-shy with messing with the master because of this happening to a server that should have just worked.

guiltykeyboard

@CodeMercenary Try the toolstack reset on that host.

Maybe do a toolstack reset on the pool master as well - which doesn't affect running VM's.

If that doesn't work, try an emergency network reset on the host having trouble.

CodeMercenary

@guiltykeyboard Restarting the toolstack on the pool master did not fix it so I went ahead and did the emergency network reset on the host having the issue. It came up normally after the reboot. The emergency reset was super easy, Thank you for helping me with this. Still learning XCP-ng and trying to understand what's happening before trying things to fix it, so in the future I'll know more and hopefully be able to help other people.