Hiding hypervisor from guest to prevent Nvidia Code 43
I think another reason xcp-ng doesn't want to dedicate resources to implement this is because there isn't a ton of demand from it from an Enterprise/Production standpoint. Most Production environments are going to be using a Quadro if going the Nvidia route.
I'm curious if the Titans are also limited in the way the other consumer Nvidia cards are.
Sorry, I did not intend to be rude. My point was that it is a known problem with a known solution (albeit for a different code base).
I would love to take have a go with implementing a patch. Unfortunately at this time I do not posses neither the hardware, nor the knowledge or time to do it (day job and family).
Otherwise I agree with @Biggen. This is a really niche use case and not many people are interested in it, therefore it is not considered important. Thanks for putting it in a more understandable way mate.
It just sucks to be on the minority side I guess.
P.S. @olivierlambert Are you saying that there are no plans for implementing a patch for this in the future?
@Biggen as far as I know only vGPU enabled cards (Grid/Tesla) do not have this problem, and even so those cards require crazy subscription-based licenses to work. We are talking about sums that only enterprise data centers can afford. Furthermore, even if you have a Tesla card + license you still cannot do a PCI pass trough - you are limited to assigning "slices" (vGPUs) to the VMs + you need a proprietary driver on the host for that to work and such does not exist for XCP-NG.
Consumer, Titan... PCI passthrough = code 43 (unless you find a way to trick the drivers to think they are not running in a VM). I read mixed reports about Quadro, so I would say that those are a hit or miss, depending on the card and driver used. About the drivers themselves - basically older drivers are more likely to work and by older I mean downloaded in the past older drivers. I read complaints in other forums saying "driver XXX worked before but when I downloaded it now it does not work anymore" so my guess is that they have recompiled the older drivers to include the code 43 error.
I'm not saying there's no plan. What I did:
- I immediately reacted to those request in the past by asking Xen community
- The answer was (in short): "in its current shape, Xen isn't able to be easily modified to support this.". However, I was told that might be easier in the future (Xen 4.14 could come with a code base that would be easier to modify for what we need).
- So now, I'm keeping an eye on this, to catch the opportunity after it won't require 5 developers working full time on this for 6 months.
It's always a ratio with risk/reward. Clearly, the huge amount of work (and resources) to achieve that NOW is clearly not realistic. But this will change with future Xen release.
I see, 4.14 is supposed to be released sometime during the summer if I am not mistaken?
Can you elaborate a bit more on "could come with a code base that would be easier to modify for what we need". What makes you say that?
That wasn't my words but those from a Xen core dev. Right now, the part that you want modified is not easily editable: it's not meant to be "exposed" or changed whatsoever. Not configurable if you prefer.
So there's some heavy lifting to allow deeper Xen "static values" to be edited. Without this, you are doomed.
Okay I haven't checked the XEN sources extensively but I presume that those 'static values' you are talking about are hard-coded somewhere in there. From what I have read, Nvidia's detection method is to look for specific strings in specific places. I believe that (at least part of) the KVM patch is to randomize those values.
So, at least theoretically, wouldn't it be possible to hard-code different values and recompile the whole source? That should provide at least a temporary fix, and I understand that everyone would have to do it for themselves, but perhaps it would be possible for someone with a more detailed knowledge on the project to create a guide, perhaps on the wiki, which people interested in this and with enough technical expertise can follow.
I have exactly 0 resources available right now to work on this problem. As I said, this will required a lot of time to reach to a result that:
- won't be upstreamed if it's "hacky" (current Xen code base won't allow to do that properly)
- will require entire Xen rebuilt and package creation
As a small team today, should we waste time on something that won't last long nor being upstreamed?
Really, we gladly accept contributions, but I won't put that as our priority 1 before UEFI and secure boot for VMs (current Xen work on our side) and other capital features that can be done with far less efforts.
Please put your request in perspective: you aren't alone in the world.
Of course I am not alone and of course I am not implying to leave all other work and deal with this. I was just asking if it is a possible (albeit "hacky") and strictly DIY solution (no up-streams or anything like that).
I will gladly look into this some more over the incoming months when I have the time and preferably appropriate hardware.
Actually, there are patchs for Xen, and it is working with Xen with driver patcher, but it doesn't work on Xenserver/XCP sadly
I'm not familiar with Xen code base (I took a look, and ugh) enough to know where to apply the hiding, but I don't think it should take months for someone familiar with code base.
This is exactly the patch I asked Xen team about, and the answer was: "this is an ugly hack that will never be upstream" (until Xen will expose an interface to made those changes).
edit: I'll reask when Xen code base will be more ready to get this
From the github post, it seems the blacklisting the GPU works, which is similar to how kvm does it, without modification to Xen works?
So I asked some people in Xen team: CPUID/MSR changes needed to be done for this use case aren't ready yet.
Is it not better to vote with your wallet and choose something else than nvidia?
Sadly, ATI is being going downhill since AMD bought them over.