Troubleshooting System Interrupts issue
-
I have a Win10Pro guest that I've been running on XenServer 7.2 for years. The pool was two servers with one server mostly remaining powered off due to no need (home lab setup).
The main server starting throwing an error with one of the locally-installed SSD's related to a coalescing issue. I opted to bring the second server online, upgrade it in place to XCP 8.2.1 then upgrade the Pool Master to 8.2.1 as well. I moved ALL of the workloads to the second server, shut the main server down and removed it from the pool. I then did a clean install of 8.2.1 and added it back to the pool.
No matter the situation, this one guest has some "choppiness" to it that always traces back to high System Interrupts in Task Manager. Since this is commonly tied to hardware in the Windows world, I'm at a complete loss of where to even begin troubleshooting.
I've run SFC, verified all updates are installed, and stripped out all "extra" software that I simply don't need. It's running on a workstation class machine (HP Z800) with 96GB RAM and dual hexacore Xeon CPU's (3.33GHz). 4 vCPU's allocated to this guest (two sockets, two cores per socket) along with 16GB RAM (dedicated). I have two other Win10Pro guests that are configured with the same rough HD setup, same or less CPU's, and same or less RAM and have zero issues with them. The -only- config-related difference is that this guest has a Bitlocker encrypted drive. This was done long before the issues started, however.
Any suggestions on where to look first?
-
@ember1205 Any differences in the firmware version on that new host from the other host? How about the BIOS settings?
-
@tjkreidl said in Troubleshooting System Interrupts issue:
@ember1205 Any differences in the firmware version on that new host from the other host? How about the BIOS settings?
Thanks for the thoughts.
All of the HP Z800 machines that I have are the same architecture, same BIOS version, and (generally) same settings. By the time the host server loads, everything exposed to the guests appears the same.
The guest was running perfectly fine on this guest for a long time. The problem seemed to first occur when I tried moving the virtual disk off of the local SSD drive and onto an NFS-based repository. The SSD would alert periodically about an issue with a coalesce operation and the disk would not copy off of the drive. I had to use Clonezilla to get "as much as I could", skipping a handful of bad sectors. I honestly don't remember how closely all of that aligned with this issue, so I don't know that it's absolutely the cause.
With previous versions of Windows, I could reinstall the OS right over the top of the one that was there, no loss of anything, and it would update little oddities like a bad device or driver along the way. I haven't really found a way to do what with Win10, so I'm trying to troubleshoot the issue to directly repair it.
-
@ember1205 If you want to risk it, you could try to migrate another VM over to that host and see if the performance also deteriorates. That would nail down th eissue to the host vs. the storage (assuming you have the shared storage SRs between the hosts). With a new install on that hosts, did you adjust the dom0 memory allocation to be the same as the original host?
-
@tjkreidl said in Troubleshooting System Interrupts issue:
@ember1205 If you want to risk it, you could try to migrate another VM over to that host and see if the performance also deteriorates. That would nail down th eissue to the host vs. the storage (assuming you have the shared storage SRs between the hosts). With a new install on that hosts, did you adjust the dom0 memory allocation to be the same as the original host?
I have plenty of other guests running on the same host, no issue. I've migrated all guests off to another host, no change.
I have other guests that are materially similar to this one that have no issue. It's very odd...
-
@ember1205 Maybe try to uninstall and re-install XenTools?
-
@tjkreidl said in Troubleshooting System Interrupts issue:
@ember1205 Maybe try to uninstall and re-install XenTools?
I have done that as well. I removed the tools that came with the 7.2 XenServer software and installed the latest Citrix XenTools. No difference.
Same software across all of the Win10 guests...
Launching Task Manager will immediately show the System Interrupts pegging at 100% then settle down. The performance tab of the guest (from another machine) never shows the 4 vCPU go above about 40% (even though task manager claims the CPU is at 100%), the memory (16GB min/max - 'reserved') utilization is at about 4-5GB, and there's little to no network or disk traffic.
-
@ember1205 You have dom0 configured with the same memory allocation? If you run "ioxtat -x" and xentop, do you see anything saturating? The other thought is that something is screwed up with that specific VM as you apparently do not see the symptoms on any other VM.
-
@tjkreidl said in Troubleshooting System Interrupts issue:
@ember1205 You have dom0 configured with the same memory allocation? If you run "ioxtat -x" and xentop, do you see anything saturating? The other thought is that something is screwed up with that specific VM as you apparently do not see the symptoms on any other VM.
I am absolutely of the thought that it's the VM, but my expectation is that it's "inside" of the VM as opposed to an external piece.
I don't know what I would want to be looking for in terms of the saturation items. Also, the xstat output shows 16 vCPU, but these machines are 24 vCPU (dual hexacore with hyperthreading). The XCP-NG Client program shows 24 vCPU in the graphs. This is how it has always been, even with the prior versions of XenServer.
-
@ember1205 Weird. You don't have an older backup of the VM, per chance? Sounds almost like it may have been somehow corrupted.
-
@tjkreidl said in Troubleshooting System Interrupts issue:
@ember1205 Weird. You don't have an older backup of the VM, per chance? Sounds almost like it may have been somehow corrupted.
I don't. And I didn't realize the SSD was giving coalesce warnings (which is where the VHD was stored for a very long time) until I tried to move it off to a NFS repository. I ended up cloning the drive to get "most of it" so that it was functional, but something is corrupted in there somewhere. It has to be.
-
@ember1205 I agree, there is no other reasonable explanation. Alas, backups are so critical. I hope, if need be, you can re-create the VM somehow! That, or you'll have to have it limp along as is I'm afraid.
-
@tjkreidl said in Troubleshooting System Interrupts issue:
@ember1205 I agree, there is no other reasonable explanation. Alas, backups are so critical. I hope, if need be, you can re-create the VM somehow! That, or you'll have to have it limp along as is I'm afraid.
Yeah, I'll just leave it as is for now. Rebuilding is an option, but not a great one as I'm looking to make the move to Win11 "soon."
-
@Marcsteven said in Troubleshooting System Interrupts issue:
To resolve troubleshooting system interrupts errors, many methods can be executed. You can maybe disable sound effects. To do that, you can go to the taskbar and right-click the speaker icon. Then double-click your Default Device (speaker) > open Properties. Now tap the option of Enhancements and check the box which says โDisable all sound effects.โ Now press Ok to save the settings, and we will hope this will resolve the method. If it doesn't work then maybe you can update your PC's Bios by launching the Windows search > write CMD > run Command Prompt and then type these commands one after another > hit before you enter after each command systeminfo | findstr /I /c:bios wmic bios get manufacturer, smbiosbiosversion .
Appreciate the ideas here, but most of it pertains to desktop machines not virtual guest instances (upgrading the PC BIOS is not really applicable for this unless it were occurring on ALL guests, for example).