XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    More than 64 vCPU on Debian11 VM and AMD EPYC

    Scheduled Pinned Locked Moved Compute
    35 Posts 7 Posters 3.9k Views 8 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      DPU is also an option (it's exactly what we do with Kalray's DPUs)

      P 1 Reply Last reply Reply Quote 0
      • T Offline
        TodorPetkov @alexredston
        last edited by

        @alexredston What kernel do you use? Can you show the kernel boot parameters (/proc/cmdline) - in our case we used the Debian11 image from their website which had cloud kernel and acpi=on by default. Once we switched to regular kernel and turned off acpi, we saw all the vCPUs in the VM.

        A 2 Replies Last reply Reply Quote 0
        • P Offline
          POleszkiewicz @olivierlambert
          last edited by

          @olivierlambert what exactly do you support from kalray? Could you tell more?

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by olivierlambert

            https://xcp-ng.org/blog/2021/07/12/dpus-and-the-future-of-virtualization/

            https://xcp-ng.org/blog/2021/12/20/dpu-for-storage-a-first-look/

            P 1 Reply Last reply Reply Quote 0
            • P Offline
              POleszkiewicz @olivierlambert
              last edited by

              @olivierlambert interesting, however where is the benefit over nvmeof + sriov doable on a mellanox cx3 or better cx5 and up? Offloading dom0 to specialized hardware is interesting, but what I see in these articles is basically equal to connecting to nvmeof target via sriov nic, doable already for quite a while without any changes in xcp-ng?

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                It's using local NVMe and split them, no need for external storage (but you can also use remote NVMe like in oF but also potentially multiple hosts in HCI mode)/

                P 1 Reply Last reply Reply Quote 0
                • P Offline
                  POleszkiewicz @olivierlambert
                  last edited by

                  @olivierlambert with NVMeOF I can split them easily too (target per namespace), and actually I get redundancy compared to local device (connect to two targets on different hosts and RAID1 them in VM). Some newer NVMe support SR-IOV natively too, so no additional hardware would be needed to split it and pass through to VMs (I did not test this though). I'm not sure of the price of these cards, but CX3 are really cheap, while CX5/6 are getting more affordable too.

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    If you can afford a dedicated storage, sure 🙂 For local, DPU is a good option (and it should be less than 1,5k€ per card, probably less)

                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      alexredston @olivierlambert
                      last edited by

                      @olivierlambert @POleszkiewicz Thanks to you both for all of these ideas - I will have a go at changing the kernel and moving the NVMe to pass through in the first instance. Will report back on results.

                      1 Reply Last reply Reply Quote 0
                      • A Offline
                        alexredston @TodorPetkov
                        last edited by

                        @TodorPetkov Top tip! Thank you - going to try this out

                        1 Reply Last reply Reply Quote 0
                        • A Offline
                          alexredston @TodorPetkov
                          last edited by

                          @TodorPetkov that was very helpful. I've added acpi=off to grub and I am now able to get 128 "CPUs" running, which is double.

                          When I go beyond this I get the following error when attempting to start the VM

                          INTERNAL_ERROR(xenopsd internal error: Xenctrl.Error("22: Invalid argument"))

                          Architecture: x86_64
                          CPU op-mode(s): 32-bit, 64-bit
                          Byte Order: Little Endian
                          CPU(s): 128
                          On-line CPU(s) list: 0-127
                          Thread(s) per core: 1
                          Core(s) per socket: 32
                          Socket(s): 4
                          NUMA node(s): 1
                          Vendor ID: GenuineIntel
                          CPU family: 6
                          Model: 79
                          Model name: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
                          Stepping: 1
                          CPU MHz: 2194.589
                          BogoMIPS: 4389.42
                          Hypervisor vendor: Xen
                          Virtualization type: full
                          L1d cache: 32K
                          L1i cache: 32K
                          L2 cache: 256K
                          L3 cache: 56320K
                          NUMA node0 CPU(s): 0-127

                          Going to move some stuff around and try passthrough for the M.2 drives next as IOPs is now the biggest performance barrier for this particular workload.

                          1 Reply Last reply Reply Quote 0
                          • A Offline
                            alexredston @olivierlambert
                            last edited by

                            @olivierlambert following a similar approach of multiple VDIs and going raid 1 with 3 way mirror (integrity is critical) will I still see a similar read performance increase, I'm not so worried about the write penalty?

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              Yes, since you'll read on multiple disks. You shouldn't see any diff in write though.

                              A 1 Reply Last reply Reply Quote 0
                              • A Offline
                                alexredston @olivierlambert
                                last edited by

                                @olivierlambert Interestingly, so far I've seen about a 40% increase in write performance and IOPS from adjusting the scheduler in dom0 by adding elevator=noop as a kernel parameter and a further 10% from repeating the same on the VM.

                                I'm going to experiment next with migrating the disks so that the mirror is achieved in the VM with three separate pifs instead of in dom0. Then may try other more radical approaches like passthrough.

                                1 Reply Last reply Reply Quote 1
                                • olivierlambertO Offline
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by

                                  That's a very nice increase. Indeed, noop is the best option for NVMe devices.

                                  A 2 Replies Last reply Reply Quote 0
                                  • A Offline
                                    alexredston @olivierlambert
                                    last edited by

                                    @olivierlambert will repeat on everything!

                                    1 Reply Last reply Reply Quote 1
                                    • A Offline
                                      alexredston @olivierlambert
                                      last edited by

                                      @olivierlambert Thanks to everyone's great advice. I've now managed a further more than 20 fold increase by using PCI passthrough on the 3 x NVMe drives, machine is only PCIe 3.x but still I'm getting 10.5GB /s reading on the test with fio and just over 1GB/s write.

                                      My bottleneck for compiling is now once again the CPUs.

                                      I seem to be unable to exceed 128 CPUs, was hoping to assign more as the host has 176 but it is struggling, at the moment my build is pinning those 128 at 100% CPU for 30 minutes so this could potentially offer a fairly significant improvement.

                                      Overall quite pleased to be squeezing this much performance out of some old HPE Gen 9 hardware. May look at adding another disk to the mirror, but at some point the write penalty may outweigh the excellent read performance. I've put chosen slots based on ensuring each NVMe's PCI lanes are connected to a different host CPU.

                                      May try another experiment with smaller PCIe devices and bifurication and see if I can test the upper limits of the throughput. 9 slots to play with!

                                      1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates 🪐 Co-Founder CEO
                                        last edited by

                                        Indeed, PCI passthrough helps tremendously to reach near bare metal performances (on the storage part). Now indeed, the CPU will have issues to keep up. You can try statically partition your hardware at CPU level, ie pinning vCPUs to real CPUs, ideally for all your VMs so you will be 100% sure Xen scheduler will never affect your VM performance.

                                        A 1 Reply Last reply Reply Quote 0
                                        • A Offline
                                          alexredston @olivierlambert
                                          last edited by olivierlambert

                                          @olivierlambert

                                          Now attempting to push this further - when I go beyond 128 CPUs on the VM configuration I am getting the following:

                                          vm.start
                                          {
                                            "id": "d9b39e2d-a95b-b8bf-dc5f-01d176c49c70",
                                            "bypassMacAddressesCheck": false,
                                            "force": false
                                          }
                                          {
                                            "code": "INTERNAL_ERROR",
                                            "params": [
                                              "xenopsd internal error: Xenctrl.Error(\"22: Invalid argument\")"
                                            ],
                                            "call": {
                                              "method": "VM.start",
                                              "params": [
                                                "OpaqueRef:82abc808-84b8-4bc5-9db9-2e6ef20a5e4a",
                                                false,
                                                false
                                              ]
                                            },
                                            "message": "INTERNAL_ERROR(xenopsd internal error: Xenctrl.Error(\"22: Invalid argument\"))",
                                            "name": "XapiError",
                                            "stack": "XapiError: INTERNAL_ERROR(xenopsd internal error: Xenctrl.Error(\"22: Invalid argument\"))
                                              at Function.wrap (file:///opt/xo/xo-builds/xen-orchestra-202401131411/packages/xen-api/_XapiError.mjs:16:12)
                                              at file:///opt/xo/xo-builds/xen-orchestra-202401131411/packages/xen-api/transports/json-rpc.mjs:35:21"
                                          }
                                          
                                          andSmvA 1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by

                                            You can't go beyond 128 vCPUs at the moment, I think it's a Qemu limitation in XCP-ng (or something like that, pinging @andSmv )

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post