I have raised this topic before but with no real resolution but I am hoping to readdress this issue.
- We're on 8.2.1 XCP-ng with latest XOA etc
- In this particular environment servers can have a maximum of 9 PCI cards that are 16x, and 1 card that can run 8x.
- Physical servers are 100% patched, firmware patched etc, everything that can be updated is updated
What we're seeing in host servers is that each server essentially can lose 1-2 of their GPUs. We're using NVIDIA Quadro T1000 (8GB) cards with 1 card being assigned to 1 VM using Passthru.
What will happen is that a user is working and then poof their GPU disappears from windows, they get an alert etc. That GPU will be "gone" until I reboot the physical host server, it will come back and be useable but then within 24 hours of use it will disappear again.
This issue doesn't happen on ALL cards, just a few. I have done some digging to see what the chances are that there's a physical card problem but the cards are all showing in the OS and lspci. I can see those cards are there, but they essentially get locked and are no longer assignable even if I restart the toolstack.
I am at a loss, it's puzzling and causing a lot of issues lol