XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Nvidia Quadro P400 not working on Ubuntu server via GPU/PCIe passthrough

    Scheduled Pinned Locked Moved Compute
    106 Posts 8 Posters 28.3k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W Offline
      warriorcookie @olivierlambert
      last edited by

      @olivierlambert said in Nvidia Quadro P400 not working on Ubuntu server via GPU/PCIe passthrough:

      1. True Type-1 hypervisor (like ESXi, unlike KVM) makes it more isolated but harder to do things in general
      2. It's as hard in ESXi, but resources on the hypervisor are 2 or 3 order of magnitude higher than for the Xen project.

      Obviously, we are working hard here at Vates to get more people directly involved in the Xen project. But it takes time and a vast amount of money to reach our target 🙂 Anyway, I'll try to see what I can do with our resources. The main issue for me now on this feature: it's mainly for non-pro usage, so no company will finance that.

      I certainly appreciate the challange, and I wish I had something to offer to help development wise.
      Perhaps a more "pro" use case could be from the standpoint of nested VM with the likes of HyperV?

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        I think I never heard of any company/industry requiring the possibility to hide the hypervisor for now (doesn't mean it doesn't exists, though).

        It's all a matter of priorities and limited resources sadly… That's why taking the problem from the other angle (ie drivers that doesn't check it) might be a correct solution.

        T 1 Reply Last reply Reply Quote 0
        • T Offline
          TheFrisianClause @olivierlambert
          last edited by TheFrisianClause

          @olivierlambert
          Well I am also looking into a Quadro K2000/M2000 or something similar, I believe those would passthrough without any issues?

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            I think we should go back to the main accountable company about all of this: it's Nvidia 😄 They probably have the answer on how they decided to artificially segment their product line via their drivers 😛

            W 1 Reply Last reply Reply Quote 0
            • W Offline
              warriorcookie @olivierlambert
              last edited by

              @olivierlambert that is a certainty. But no amount of bribery or blackmail seems to make them want to let us in on the secret...

              T 1 Reply Last reply Reply Quote 0
              • T Offline
                TheFrisianClause @warriorcookie
                last edited by

                @warriorcookie Well I am at the point on almost giving up on the P400 and just go with an M2000. I currently have an AMD Radeon Pro WX2100 laying around but no luck with this one Plex...

                1 Reply Last reply Reply Quote 0
                • T Offline
                  TheFrisianClause
                  last edited by

                  Currently I can get a M4000 for a decent price, I think this one should be able to passthrough in XCP-NG? But I have no idea if this will actually passthrough, has anyone reading this experience with passing through an Quadro M2xxx or M4xxx card to XCP-NG?

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    You should ask on Nvidia forum, I assume it will be a safe bet 🙂

                    1 Reply Last reply Reply Quote 0
                    • T Offline
                      TheFrisianClause
                      last edited by

                      I will take the bet, an M4000 for 200 euro's is pretty cheap as I have seen M2000 for higher prices.

                      1 Reply Last reply Reply Quote 0
                      • T Offline
                        TheFrisianClause
                        last edited by TheFrisianClause

                        Well just tested the Quadro M4000 also no luck on this one... Although this GPU is alot more powerfull than the P400 so I will keep in anyway....

                        EDIT: Well complained to quickly, apparently a host reboot resolved it and I see the GPU now in the Plex VM 😄

                        X 1 Reply Last reply Reply Quote 0
                        • X Offline
                          XCP-ng-JustGreat @TheFrisianClause
                          last edited by

                          @thefrisianclause Hello, not sure if I missed something from the above thread, but did any of you try to turn off the CPUID "hypervisor present" bit on an Intel-based XCP-ng host VM using this technique from the thread referenced by @warriorcookie above? https://xcp-ng.org/forum/topic/4643/nested-virtualization-of-windows-hyper-v-on-xcp-ng/26

                          It is the equivalent of the ESXi Hypervisor.CPUID.v0="FALSE" vmx file configuration tweak. It configures the XCP-ng VM to, in effect, lie to the guest OS by saying, "you are not running on a hypervisor."

                          T W 2 Replies Last reply Reply Quote 0
                          • T Offline
                            TheFrisianClause @XCP-ng-JustGreat
                            last edited by TheFrisianClause

                            @xcp-ng-justgreat Does this also work for Linux? As I currently run Linux with this and not Windows. But I will have a look into it 🙂

                            Also I am running this on an AMD Ryzen 3800XT.

                            EDIT: I also had the Quadro M4000 running this afternoon, but had to redo the VM. Now I cant get it to work anymore and did the same tasks. Sometimes Xen is like a 'lady' sometimes it works and sometimes it doesn't 🙂

                            X 1 Reply Last reply Reply Quote 0
                            • X Offline
                              XCP-ng-JustGreat @TheFrisianClause
                              last edited by

                              @thefrisianclause Yes, it should work for a Linux guest too. It alters the VM's apparent CPUID as presented to the guest OS--whatever that happens to be. I'm not one-hundred percent sure about AMD, but try the same technique. The same bit probably has the same purpose on an AMD CPU.

                              T 1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                How would you have hide this exactly? For Linux, the "best solution" for now is to get a modified Nvidia driver without the virt check. On Windows, it's already working since recent drivers (they removed the check)

                                1 Reply Last reply Reply Quote 0
                                • T Offline
                                  TheFrisianClause @XCP-ng-JustGreat
                                  last edited by TheFrisianClause

                                  @xcp-ng-justgreat Unfortunately this did not work on Ubuntu 20.04.3 LTS ..... Even my recently bought Quadro M4000 won't work...

                                  I am now getting this error:

                                  [   50.123425] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x25:0x40:1250)
                                  [   50.123475] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
                                  [   51.102646] NVRM: GPU 0000:00:06.0: GPU has fallen off the bus.
                                  [   51.304888] NVRM: A GPU crash dump has been created. If possible, please run
                                                 NVRM: nvidia-bug-report.sh as root to collect this data before
                                                 NVRM: the NVIDIA kernel module is unloaded.
                                  [   51.507717] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x24:0xffff:1220)
                                  [   51.507755] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
                                  
                                  X 1 Reply Last reply Reply Quote 0
                                  • X Offline
                                    XCP-ng-JustGreat @TheFrisianClause
                                    last edited by

                                    @thefrisianclause Sorry to hear that. NVIDIA does jealously guard its secret sauce from the world. Altering the CPUID hypervisor bit falls within the realm of unnatural acts. While the technique has proven useful in other use cases and is a good thing to know about, it may be that countermeasures have been added to the GPU drivers to expose the lie. Hard to know . . .

                                    1 Reply Last reply Reply Quote 0
                                    • T Offline
                                      TheFrisianClause
                                      last edited by

                                      It is strange as I had it working this afternoon, on a different VM.... But shouldnt Quadro cards not have these particular issues (code 43), as they are 'Quadro'?

                                      X W 2 Replies Last reply Reply Quote 0
                                      • X Offline
                                        XCP-ng-JustGreat @TheFrisianClause
                                        last edited by

                                        @thefrisianclause That's a good point. However, it's not here http://hcl.xenserver.org/gpus/?gpusupport__version=20&vendor=50 so NVIDIA and Citrix have no obligation to support it for their commercial customers. If you do get it to work, it's by the grace of Vates and/or other XCP-ng users here. Best of luck!

                                        1 Reply Last reply Reply Quote 0
                                        • T Offline
                                          TheFrisianClause
                                          last edited by TheFrisianClause

                                          @olivierlambert @XCP-ng-JustGreat For some reason I got the M4000 working again in XCP-NG

                                          137933c1-672c-4217-8f3a-a37b7052969d-image.png

                                          Lets hope it does not leave me again 😞

                                          EDIT:
                                          Well I tested something in a case of an emergency reboot where I rebooted the whole host, the card has fallen of the bus and nvidia-smi does not work anymore.

                                          [   19.111279] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x25:0x40:1250)
                                          [   19.111325] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          [   20.089528] NVRM: GPU 0000:00:05.0: GPU has fallen off the bus.
                                          [   20.292170] NVRM: A GPU crash dump has been created. If possible, please run
                                                         NVRM: nvidia-bug-report.sh as root to collect this data before
                                                         NVRM: the NVIDIA kernel module is unloaded.
                                          [   20.495046] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x24:0xffff:1220)
                                          [   20.495093] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          [   21.468090] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x22:0x56:667)
                                          [   21.468109] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          [   22.077205] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x22:0x56:667)
                                          [   22.077257] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          [   22.686023] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x22:0x56:667)
                                          [   22.686061] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          [   23.294321] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x22:0x56:667)
                                          [   23.294381] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          [   24.479743] rfkill: input handler disabled
                                          [   30.633389] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x22:0x56:667)
                                          [   30.633444] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          [   31.242314] NVRM: GPU 0000:00:05.0: RmInitAdapter failed! (0x22:0x56:667)
                                          [   31.242353] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor number 0
                                          

                                          I believe this has nothing to do with code 43?

                                          And it lets my whole XCP-NG host crash when trying to reboot the plex vm again...

                                          So I assume the card works, but falls off the bus for some reason and I have no idea why?

                                          EDIT 2:
                                          Just reinstalled a new Ubuntu 21.10, and there it works as well.
                                          1460b46f-a64b-4489-8eac-07257e961053-image.png

                                          Also after reboot of the Ubuntu 21.10, this keeps working:
                                          ab21cf99-e61c-40f0-afee-e55123b0bca1-image.png

                                          but when I reboot the host itself, I get this:
                                          93c74d63-45ca-451a-90ff-f4c282dfdbc3-image.png
                                          And the GPU falls off the bus, resulting in this error:

                                          [   24.428255] loop3: detected capacity change from 0 to 8
                                          [   42.582836] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x25:0x40:1250)
                                          [   42.582888] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
                                          [   43.562201] NVRM: GPU 0000:00:06.0: GPU has fallen off the bus.
                                          [   43.764639] NVRM: A GPU crash dump has been created. If possible, please run
                                                         NVRM: nvidia-bug-report.sh as root to collect this data before
                                                         NVRM: the NVIDIA kernel module is unloaded.
                                          [   43.968873] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x24:0xffff:1220)
                                          [   43.968930] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
                                          

                                          So I believe something happens on the XCP-NG side in this matter?

                                          EDIT 3:
                                          Looking further in the logs I find this: [ 10.315135] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000006] Failed to allocate NvKmsKapiDevice
                                          [ 10.315943] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000006] Failed to register device

                                          EDIT 4: So sometimes it works and sometimes it doesnt, it goes very sporadically. So now I have the feeling that it could be an error on XCP-NG's side but I am not 100% sure. As the card itself works apparently, but I also have the feeling this could be an power consumption issue as I use an 2x Molex to 6 Pin PCI-E converter for this graphics card.

                                          (Sorry for the long message)...

                                          1 Reply Last reply Reply Quote 0
                                          • W Offline
                                            warriorcookie @TheFrisianClause
                                            last edited by

                                            @thefrisianclause Not all Quadro cards. Datacenter class cards will passthrough fine. Workstation cards do not on Linux guests. They will passthrough on windows guests as Nvidia removed the check in the driver.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post