XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    More than 64 vCPU on Debian11 VM and AMD EPYC

    Scheduled Pinned Locked Moved Compute
    35 Posts 7 Posters 3.9k Views 8 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by olivierlambert

      https://github.com/xenserver/xen.pg/blob/XS-8.3.x/patches/0001-Partially-revert-08754333892-hvmloader-limit-CPUs-ex.patch

      My source is one of the main Xen dev 🙂 If you have working setups with more than 64 vCPUs, I'm curious!

      T 1 Reply Last reply Reply Quote 0
      • T Offline
        TodorPetkov @olivierlambert
        last edited by

        @olivierlambert I have a working VM (booted) that shows 2 sockets, 64 cpu each. I am running sysbench with --max-threads=128 and it shows load 128.
        I played a bit, with normal acpi in the VM OS and disabled acpi in xe xe vm-param-set platform:acpi=0 results are the same. I am attaching lscpu and dmesg from the second case (acpi is untouched in the VM, but disabled with xe command)

        Let me know if anything else is needed.

        dmesg1.txt lscpu1.txt

        1 Reply Last reply Reply Quote 0
        • T Offline
          TodorPetkov
          last edited by

          On second thought, I was not clear in the beginning. I don't expect to see 1 socket with 128 vCPU in the VM, but maybe 2 sockets with splitted vCPUs between them if I assign more than 64 to the VM. Initially I had 1 socket with 64 CPU and after turning ACPI off (either in grub or in the VM config itself), secondary socket appeared in the VM with the rest of the CPUs. More funny, turning ACPI off while running the cloud kernel of Debian makes the VM see one CPU only.

          1 Reply Last reply Reply Quote 0
          • H Offline
            Hans
            last edited by Hans

            I am replying since Olivier liked to hear from others with a lot of cores.

            We are running XCP-ng 8.1 on a host with dual AMD Epyc 7713 64 core processors. With hyperthreading it is a total of 256 hyperthreading cores. Since we only are able to assign up to 128 cores to a VM, we have turned hyperthreading off. The VM is running Ubuntu 18. We should probably lower the number of vcpus to 120 or something for best performance, but at the moment it is 128.

            We can see in the Performance Graphs that all the cores are active:

            861ba1d8-f608-4ce8-afcb-1a144b7aff9f-image.png

            The output of lscpu is:

            lspci
            00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
            00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
            00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
            00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (                             rev 01)
            00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
            00:02.0 VGA compatible controller: Device 1234:1111
            00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 02)
            hansb@FVCOM-U18:~$ lscpu
            Architecture:        x86_64
            CPU op-mode(s):      32-bit, 64-bit
            Byte Order:          Little Endian
            CPU(s):              128
            On-line CPU(s) list: 0-127
            Thread(s) per core:  1
            Core(s) per socket:  64
            Socket(s):           2
            NUMA node(s):        1
            Vendor ID:           AuthenticAMD
            CPU family:          25
            Model:               1
            Model name:          AMD EPYC 7713 64-Core Processor
            Stepping:            1
            CPU MHz:             1996.267
            BogoMIPS:            3992.57
            Hypervisor vendor:   Xen
            Virtualization type: full
            L1d cache:           32K
            L1i cache:           32K
            L2 cache:            512K
            L3 cache:            262144K
            NUMA node0 CPU(s):   0-127
            Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cm                             ov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdt                             scp lm rep_good nopl cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_                             2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr                             8_legacy abm sse4a misalignsse 3dnowprefetch bpext ibpb vmmcall fsgsbase bmi1 av                             x2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsav                             es clzero xsaveerptr arat umip rdpid
            
            1 Reply Last reply Reply Quote 1
            • A Offline
              alexredston
              last edited by

              I'm getting stuck with this too - on Debian 11 VM - DL 580 with 4 x Xeon E7-8880 v4 + 3 Samsung 990 Pro 4TB with RAID 1.

              Effectively the XCP-NG host has 176 "cores" i.e. with the hyperthreading. But I'm only able to use 64 of them. I was also only able to configure the VM with 120 cores too as 30 with 4 sockets. (Physical architecture has 4 sockets), but I think only 64 actually work.

              So I'm compiling AOSP, for a clean build, VM is sticking at max CPU for 30 minutes and I would dearly like to reduce that time, as it could be a compile after a tiny change, so progress is painfully slow. The other thing is the linking phase of this build, I'm only seeing 7000 IOPs with the last 10 minute display. I realize this may under read as the traffic could be quite "bursty" but, having 3 mirrored Samsung 990 Pro drives I would expect more. This makes this part heavily disk bound, the over all process takes 70 minutes.

              T 1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                If you are heavily relying on disk perf, either:

                1. use multiple VDIs and RAID0 them (you'll have more than doubling perf because tapdisk is single threaded)
                2. PCI passthrough a drive to the VM
                P A 2 Replies Last reply Reply Quote 0
                • P Offline
                  POleszkiewicz @olivierlambert
                  last edited by

                  @olivierlambert said in More than 64 vCPU on Debian11 VM and AMD EPYC:

                  If you are heavily relying on disk perf, either:

                  1. use multiple VDIs and RAID0 them (you'll have more than doubling perf because tapdisk is single threaded)
                  2. PCI passthrough a drive to the VM

                  another option is to do NVMeOF and SR-IOV on the NIC, pretty similar performance to bare metal with PCI passthrough, yet one NVMe can be divided between VMs (if it supports namespaces) and you can attach NVMe from more than one source to the VM (for redundancy)

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    DPU is also an option (it's exactly what we do with Kalray's DPUs)

                    P 1 Reply Last reply Reply Quote 0
                    • T Offline
                      TodorPetkov @alexredston
                      last edited by

                      @alexredston What kernel do you use? Can you show the kernel boot parameters (/proc/cmdline) - in our case we used the Debian11 image from their website which had cloud kernel and acpi=on by default. Once we switched to regular kernel and turned off acpi, we saw all the vCPUs in the VM.

                      A 2 Replies Last reply Reply Quote 0
                      • P Offline
                        POleszkiewicz @olivierlambert
                        last edited by

                        @olivierlambert what exactly do you support from kalray? Could you tell more?

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by olivierlambert

                          https://xcp-ng.org/blog/2021/07/12/dpus-and-the-future-of-virtualization/

                          https://xcp-ng.org/blog/2021/12/20/dpu-for-storage-a-first-look/

                          P 1 Reply Last reply Reply Quote 0
                          • P Offline
                            POleszkiewicz @olivierlambert
                            last edited by

                            @olivierlambert interesting, however where is the benefit over nvmeof + sriov doable on a mellanox cx3 or better cx5 and up? Offloading dom0 to specialized hardware is interesting, but what I see in these articles is basically equal to connecting to nvmeof target via sriov nic, doable already for quite a while without any changes in xcp-ng?

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              It's using local NVMe and split them, no need for external storage (but you can also use remote NVMe like in oF but also potentially multiple hosts in HCI mode)/

                              P 1 Reply Last reply Reply Quote 0
                              • P Offline
                                POleszkiewicz @olivierlambert
                                last edited by

                                @olivierlambert with NVMeOF I can split them easily too (target per namespace), and actually I get redundancy compared to local device (connect to two targets on different hosts and RAID1 them in VM). Some newer NVMe support SR-IOV natively too, so no additional hardware would be needed to split it and pass through to VMs (I did not test this though). I'm not sure of the price of these cards, but CX3 are really cheap, while CX5/6 are getting more affordable too.

                                1 Reply Last reply Reply Quote 0
                                • olivierlambertO Offline
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by

                                  If you can afford a dedicated storage, sure 🙂 For local, DPU is a good option (and it should be less than 1,5k€ per card, probably less)

                                  A 1 Reply Last reply Reply Quote 0
                                  • A Offline
                                    alexredston @olivierlambert
                                    last edited by

                                    @olivierlambert @POleszkiewicz Thanks to you both for all of these ideas - I will have a go at changing the kernel and moving the NVMe to pass through in the first instance. Will report back on results.

                                    1 Reply Last reply Reply Quote 0
                                    • A Offline
                                      alexredston @TodorPetkov
                                      last edited by

                                      @TodorPetkov Top tip! Thank you - going to try this out

                                      1 Reply Last reply Reply Quote 0
                                      • A Offline
                                        alexredston @TodorPetkov
                                        last edited by

                                        @TodorPetkov that was very helpful. I've added acpi=off to grub and I am now able to get 128 "CPUs" running, which is double.

                                        When I go beyond this I get the following error when attempting to start the VM

                                        INTERNAL_ERROR(xenopsd internal error: Xenctrl.Error("22: Invalid argument"))

                                        Architecture: x86_64
                                        CPU op-mode(s): 32-bit, 64-bit
                                        Byte Order: Little Endian
                                        CPU(s): 128
                                        On-line CPU(s) list: 0-127
                                        Thread(s) per core: 1
                                        Core(s) per socket: 32
                                        Socket(s): 4
                                        NUMA node(s): 1
                                        Vendor ID: GenuineIntel
                                        CPU family: 6
                                        Model: 79
                                        Model name: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
                                        Stepping: 1
                                        CPU MHz: 2194.589
                                        BogoMIPS: 4389.42
                                        Hypervisor vendor: Xen
                                        Virtualization type: full
                                        L1d cache: 32K
                                        L1i cache: 32K
                                        L2 cache: 256K
                                        L3 cache: 56320K
                                        NUMA node0 CPU(s): 0-127

                                        Going to move some stuff around and try passthrough for the M.2 drives next as IOPs is now the biggest performance barrier for this particular workload.

                                        1 Reply Last reply Reply Quote 0
                                        • A Offline
                                          alexredston @olivierlambert
                                          last edited by

                                          @olivierlambert following a similar approach of multiple VDIs and going raid 1 with 3 way mirror (integrity is critical) will I still see a similar read performance increase, I'm not so worried about the write penalty?

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by

                                            Yes, since you'll read on multiple disks. You shouldn't see any diff in write though.

                                            A 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post