Hiding hypervisor from guest to prevent Nvidia Code 43

jtbw911

@slavD I'm in agreement with you. I wanted to replace my multiple gaming rigs with multiple VMs with GPUs (since I already run the server); but, I spent at least a month messing around with various options and never came up with a stable, reliable, and "not overly complicated" way to do so without drastically changing my approach to hardware - the easiest being going with AMD GPUs and increasing my power consumption over 9000%. It's why I've ultimately paused that project for now until some better solution comes along, or until I can "empty" my existing host and repurpose it with a solution that will work (probably unRAID despite my concerns with other parts of it).

olivierlambert

@slavD said in Hiding hypervisor from guest to prevent Nvidia Code 43:

@olivierlambert
I hope you are being sarcastic. Of course it is on purpose. Nvidia are known to f around with people like that. Why are you refusing to implement a simple option to hide the hypervisor like KVM does?

I hope you are being sarcastic. How on earth you think it's easy to do it in Xen?

Reality is not your gut feeling. It's FAR from being a simple/trivial change in Xen. It's really complicated. I'm not refusing do to it, it's just something that would requires months.

slavD

@olivierlambert
I am not pretending to be a specialist in this area. However, KVM did it years ago AND it is open source. Furthermore, there are countess DIY guides from back when it wasn't implemented in KVM yet. The way I understand it the main things that need changing are the hypervisor CPU feature flag and CPUID leaves which need to be hidden from the guest and the hardware vendor id which needs to be set to something random.

I've checked the other Nvidia threads on this forum. People are complaining about this problem for more than a year. Furthermore, they are giving references to guides and sources on how its done with KVM.

What does "it is really complicated" mean? If the XCP-NG team does not how to do it, even when it is known what causes the problem and there is an open source solution to serve as example, then how did you even get where you are now? Don't pretend. If you really wanted to you could have implemented it by now.

At one point I was even considering paying for XO, but this really is a deal breaker for me.

olivierlambert

Holy cow

Your attitude will be really helpful, for sure.

Why on earth there is any connection between KVM and Xen? They are truly different projects, with a code base that has literally nothing in common. Something that applies to KVM doesn't apply to Xen.
Really complicated means this would require a kind of overhaul into Xen code base. In short, there's no trivial way to do it right now. We aren't talking about changing one line. So NO: it's not a matter that I wan't to do. IT IS complicated.
Why, but WHY I would pretend that I don't want to do it? I'd love to have a quick patch a solve this for people who want it. But clearly, this will requires A LOT of work. How do I know it? Because I already asked top Xen devs about this 6 month ago.

Prove me wrong with a patch and I'll be VERY happy to integrate in into XCP-ng.

Seriously mate, your tone is really unpleasant

Biggen

I think another reason xcp-ng doesn't want to dedicate resources to implement this is because there isn't a ton of demand from it from an Enterprise/Production standpoint. Most Production environments are going to be using a Quadro if going the Nvidia route.

I'm curious if the Titans are also limited in the way the other consumer Nvidia cards are.

slavD

@olivierlambert
Sorry, I did not intend to be rude. My point was that it is a known problem with a known solution (albeit for a different code base).
I would love to take have a go with implementing a patch. Unfortunately at this time I do not posses neither the hardware, nor the knowledge or time to do it (day job and family).

Otherwise I agree with @Biggen. This is a really niche use case and not many people are interested in it, therefore it is not considered important. Thanks for putting it in a more understandable way mate.

It just sucks to be on the minority side I guess.

P.S. @olivierlambert Are you saying that there are no plans for implementing a patch for this in the future?

EDIT:
@Biggen as far as I know only vGPU enabled cards (Grid/Tesla) do not have this problem, and even so those cards require crazy subscription-based licenses to work. We are talking about sums that only enterprise data centers can afford. Furthermore, even if you have a Tesla card + license you still cannot do a PCI pass trough - you are limited to assigning "slices" (vGPUs) to the VMs + you need a proprietary driver on the host for that to work and such does not exist for XCP-NG.
Consumer, Titan... PCI passthrough = code 43 (unless you find a way to trick the drivers to think they are not running in a VM). I read mixed reports about Quadro, so I would say that those are a hit or miss, depending on the card and driver used. About the drivers themselves - basically older drivers are more likely to work and by older I mean downloaded in the past older drivers. I read complaints in other forums saying "driver XXX worked before but when I downloaded it now it does not work anymore" so my guess is that they have recompiled the older drivers to include the code 43 error.

olivierlambert

I'm not saying there's no plan. What I did:

I immediately reacted to those request in the past by asking Xen community
The answer was (in short): "in its current shape, Xen isn't able to be easily modified to support this.". However, I was told that might be easier in the future (Xen 4.14 could come with a code base that would be easier to modify for what we need).
So now, I'm keeping an eye on this, to catch the opportunity after it won't require 5 developers working full time on this for 6 months.

It's always a ratio with risk/reward. Clearly, the huge amount of work (and resources) to achieve that NOW is clearly not realistic. But this will change with future Xen release.

slavD

@olivierlambert
I see, 4.14 is supposed to be released sometime during the summer if I am not mistaken?
Can you elaborate a bit more on "could come with a code base that would be easier to modify for what we need". What makes you say that?

olivierlambert

That wasn't my words but those from a Xen core dev. Right now, the part that you want modified is not easily editable: it's not meant to be "exposed" or changed whatsoever. Not configurable if you prefer.

So there's some heavy lifting to allow deeper Xen "static values" to be edited. Without this, you are doomed.

slavD

@olivierlambert
Okay I haven't checked the XEN sources extensively but I presume that those 'static values' you are talking about are hard-coded somewhere in there. From what I have read, Nvidia's detection method is to look for specific strings in specific places. I believe that (at least part of) the KVM patch is to randomize those values.

So, at least theoretically, wouldn't it be possible to hard-code different values and recompile the whole source? That should provide at least a temporary fix, and I understand that everyone would have to do it for themselves, but perhaps it would be possible for someone with a more detailed knowledge on the project to create a guide, perhaps on the wiki, which people interested in this and with enough technical expertise can follow.

olivierlambert

I have exactly 0 resources available right now to work on this problem. As I said, this will required a lot of time to reach to a result that:

won't be upstreamed if it's "hacky" (current Xen code base won't allow to do that properly)
will require entire Xen rebuilt and package creation

As a small team today, should we waste time on something that won't last long nor being upstreamed?

Really, we gladly accept contributions, but I won't put that as our priority 1 before UEFI and secure boot for VMs (current Xen work on our side) and other capital features that can be done with far less efforts.

Please put your request in perspective: you aren't alone in the world.

slavD

@olivierlambert
Of course I am not alone and of course I am not implying to leave all other work and deal with this. I was just asking if it is a possible (albeit "hacky") and strictly DIY solution (no up-streams or anything like that).

I will gladly look into this some more over the incoming months when I have the time and preferably appropriate hardware.

imtrobin

Actually, there are patchs for Xen, and it is working with Xen with driver patcher, but it doesn't work on Xenserver/XCP sadly

https://github.com/sk1080/nvidia-kvm-patcher/issues/45#issuecomment-574680727
https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01713.html

I'm not familiar with Xen code base (I took a look, and ugh) enough to know where to apply the hiding, but I don't think it should take months for someone familiar with code base.

olivierlambert

This is exactly the patch I asked Xen team about, and the answer was: "this is an ugly hack that will never be upstream" (until Xen will expose an interface to made those changes).

edit: I'll reask when Xen code base will be more ready to get this

imtrobin

From the github post, it seems the blacklisting the GPU works, which is similar to how kvm does it, without modification to Xen works?

olivierlambert

So I asked some people in Xen team: CPUID/MSR changes needed to be done for this use case aren't ready yet.

Forza

Is it not better to vote with your wallet and choose something else than nvidia?

imtrobin

Thanks Oliver,

Sadly, ATI is being going downhill since AMD bought them over.

smithereens

@olivierlambert There is a huge and growing market for virtualised GPU: gaming, AI, 3D. Personally I'd pay $$ for this feature alone - having spent days trying to get this working. I do understand that Nvidia will use their might to squash this functionality - as it is not in their commercial interests - they'll claim it circumvents their EULA.

smithereens

I'll look for update on this topic... but for now I'm going to drop XCP-NV and go (back) to KVM.