XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Best CPU performance settings for HP DL325/AMD EPYC servers?

    Scheduled Pinned Locked Moved Compute
    17 Posts 4 Posters 3.4k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • dthenotD Offline
      dthenot Vates 🪐 XCP-ng Team @olivierlambert
      last edited by

      @olivierlambert @S-Pam Indeed, it's normal, Dom0 doesn't see the NUMA information and the hypervisor handle the compute and memory allocation. You can see the wiki about manipulating VM allocation with the NUMA architecture if you want. But in normal use-cases it's not worth the effort.

      ForzaF 1 Reply Last reply Reply Quote 2
      • ForzaF Offline
        Forza @dthenot
        last edited by Forza

        Thanks!

        The link to https://wiki.xenproject.org/wiki/Xen_on_NUMA_Machines explains it pretty well.

        xl info -n shows numa config

        # xl info -n
        host                   : srv01
        release                : 4.19.0+1
        version                : #1 SMP Tue Mar 30 22:34:15 CEST 2021
        machine                : x86_64
        nr_cpus                : 48
        max_cpu_id             : 47
        nr_nodes               : 8
        cores_per_socket       : 24
        threads_per_core       : 2
        cpu_mhz                : 2794.799
        hw_caps                : 178bf3ff:76f8320b:2e500800:244037ff:0000000f:219c91a9:00400004:00000500
        virt_caps              : pv hvm hvm_directio pv_directio hap shadow
        total_memory           : 65367
        free_memory            : 10394
        sharing_freed_memory   : 0
        sharing_used_memory    : 0
        outstanding_claims     : 0
        free_cpus              : 0
        cpu_topology           :
        cpu:    core    socket     node
          0:       0        0        0
          1:       0        0        0
          2:       1        0        0
          3:       1        0        0
          4:       2        0        0
          5:       2        0        0
          6:       4        0        1
          7:       4        0        1
          8:       5        0        1
          9:       5        0        1
         10:       6        0        1
         11:       6        0        1
         12:       8        0        2
         13:       8        0        2
         14:       9        0        2
         15:       9        0        2
         16:      10        0        2
         17:      10        0        2
         18:      12        0        3
         19:      12        0        3
         20:      13        0        3
         21:      13        0        3
         22:      14        0        3
         23:      14        0        3
         24:      16        0        4
         25:      16        0        4
         26:      17        0        4
         27:      17        0        4
         28:      18        0        4
         29:      18        0        4
         30:      20        0        5
         31:      20        0        5
         32:      21        0        5
         33:      21        0        5
         34:      22        0        5
         35:      22        0        5
         36:      24        0        6
         37:      24        0        6
         38:      25        0        6
         39:      25        0        6
         40:      26        0        6
         41:      26        0        6
         42:      28        0        7
         43:      28        0        7
         44:      29        0        7
         45:      29        0        7
         46:      30        0        7
         47:      30        0        7
        device topology        :
        device           node
        0000:00:03.1      6
        0000:c3:00.0      0
        0000:80:07.1      2
        0000:c0:03.1      0
        0000:00:08.0      6
        0000:c0:08.0      0
        0000:c5:00.3      0
        0000:00:18.3      6
        0000:02:00.2      6
        0000:40:05.0      4
        0000:c2:00.2      0
        0000:80:02.0      2
        0000:43:00.0      4
        0000:40:03.1      4
        0000:c1:00.4      0
        0000:40:08.0      4
        0000:c5:00.1      0
        0000:00:18.1      6
        0000:02:00.0      6
        0000:00:01.0      6
        0000:c2:00.0      0
        0000:42:00.2      4
        0000:80:05.0      2
        0000:c0:01.0      0
        0000:40:01.2      4
        0000:01:00.2      6
        0000:c1:00.2      0
        0000:80:03.1      2
        0000:00:04.0      6
        0000:80:08.0      2
        0000:42:00.0      4
        0000:c0:04.0      0
        0000:00:14.3      6
        0000:40:01.0      4
        0000:82:00.2      2
        0000:01:00.0      6
        0000:c1:00.0      0
        0000:41:00.2      4
        0000:00:07.0      6
        0000:c0:07.0      0
        0000:40:04.0      4
        0000:00:00.2      6
        0000:82:00.0      2
        0000:80:01.0      2
        0000:c0:00.2      0
        0000:00:18.6      6
        0000:41:00.0      4
        0000:81:00.2      2
        0000:c0:01.5      0
        0000:40:07.0      4
        0000:00:00.0      6
        0000:80:04.0      2
        0000:c0:00.0      0
        0000:40:00.2      4
        0000:00:08.1      6
        0000:c0:08.1      0
        0000:00:18.4      6
        0000:02:00.3      6
        0000:81:00.0      2
        0000:00:03.0      6
        0000:80:07.0      2
        0000:c0:03.0      0
        0000:40:00.0      4
        0000:80:00.2      2
        0000:40:08.1      4
        0000:c5:00.2      0
        0000:00:18.2      6
        0000:00:01.1      6
        0000:42:00.3      4
        0000:c0:01.1      0
        0000:40:03.0      4
        0000:80:00.0      2
        0000:c5:00.0      0
        0000:00:18.0      6
        0000:80:08.1      2
        0000:42:00.1      4
        0000:40:01.1      4
        0000:c1:00.1      0
        0000:80:03.0      2
        0000:00:07.1      6
        0000:c0:07.1      0
        0000:80:01.1      2
        0000:00:02.0      6
        0000:00:18.7      6
        0000:c0:02.0      0
        0000:40:07.1      4
        0000:00:14.0      6
        0000:c3:00.2      0
        0000:00:05.0      6
        0000:00:18.5      6
        0000:c0:05.0      0
        0000:40:02.0      4
        numa_info              :
        node:    memsize    memfree    distances
           0:     10240       1505      10,11,11,11,11,11,11,11
           1:      8192       1918      11,10,11,11,11,11,11,11
           2:      8192       1932      11,11,10,11,11,11,11,11
           3:      8192        847      11,11,11,10,11,11,11,11
           4:      8192        912      11,11,11,11,10,11,11,11
           5:      8192        912      11,11,11,11,11,10,11,11
           6:      8192       1038      11,11,11,11,11,11,10,11
           7:      8179       1326      11,11,11,11,11,11,11,10
        xen_major              : 4
        xen_minor              : 13
        xen_extra              : .1-9.9.1
        xen_version            : 4.13.1-9.9.1
        xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
        xen_scheduler          : credit
        xen_pagesize           : 4096
        platform_params        : virt_start=0xffff800000000000
        xen_changeset          : 6278553325a9, pq 70d4b5941e4f
        xen_commandline        : dom0_mem=4304M,max:4304M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G console=vga vga=mode-0x0311 sched-gran=core
        cc_compiler            : gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
        cc_compile_by          : mockbuild
        cc_compile_domain      : [unknown]
        cc_compile_date        : Thu Feb  4 18:23:36 CET 2021
        build_id               : a76c6ee84d87600fa0d520cd8ecb8113b1105af4
        xend_config_format     : 4
        

        I wonder if the CPU scheduler can do NUMA node in addition to core, CPU and socket?

        ForzaF 1 Reply Last reply Reply Quote 0
        • ForzaF Offline
          Forza @Forza
          last edited by Forza

          @s-pam said in Best CPU performance settings for HP DL325/AMD EPYC servers?:

          I wonder if the CPU scheduler can do NUMA node in addition to core, CPU and socket?

          I'll answer myself here. It seems that Xen already does this by default:

          NUMA aware scheduling, as it has been included in Xen 4.3, means that it is possible for vCPUs of a domain to just prefer to run on the pCPUs of some NUMA node. The vCPUs will still be allowed, though, to run on every pCPU, guaranteed much more flexibility than having to use pinning.

          ForzaF 1 Reply Last reply Reply Quote 0
          • ForzaF Offline
            Forza @Forza
            last edited by Forza

            Sorry for spamming the thread. 🙂

            I have two identical servers (srv01 and srv02) with AMD EPYC 7402P 24 Core CPUs. On srv02 I enabled the LLC as NUMA Node.

            I've done some quick benchmarks with Sysbench on Ubuntu 20.10 with 12 assigned cores. Command line: sysbench cpu run --threads=12

            It would seem that in this test the NUMA option is much faster, 194187 events vs 103769 events. Perhaps I am misunderstanding how sysbench works?

            b65ec3da-4b1d-430e-b90d-02542fe59552-image.png

            With 7-zip the gain is much less, but still meaningful. A little slower in single-threaded performance but quite a bit faster in multi-threaded mode.
            f9592ee9-d327-4ce1-9e34-0ee86280d9e9-image.png

            1 Reply Last reply Reply Quote 2
            • ForzaF Offline
              Forza
              last edited by

              I ran a simulation run with Dassault's SIMULIA Abaqus FEA. Simulation went down from 75 to 60 minutes, so a big win there too 😃

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                It's not spam, it's interesting feedback 🙂 Never hesitate to share it!

                ForzaF 1 Reply Last reply Reply Quote 1
                • ForzaF Offline
                  Forza @olivierlambert
                  last edited by

                  @olivierlambert said in Best CPU performance settings for HP DL325/AMD EPYC servers?:

                  It's not spam, it's interesting feedback 🙂 Never hesitate to share it!

                  Thanks!

                  The last benchmark is a real-world example. We have a master thesis student that needs to run approximately 150 simulations as part of the program and she is pretty thrilled to be saving several days on the run time 🙂

                  dthenotD 1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    That's great! I think @dthenot would be interested reading this 😛

                    1 Reply Last reply Reply Quote 1
                    • dthenotD Offline
                      dthenot Vates 🪐 XCP-ng Team @Forza
                      last edited by

                      @s-pam Damn, computer are really magic. I'm very surprised about these result.
                      Does the NONUMA really mean no NUMA info being given by the firmware?
                      I have no idea how the scheduler of Xen uses this information, I know that the memory allocator strip the memory of the VM on all nodes the VM is configured to be allocated on. As such it would mean the scheduler is doing good work on scheduling the VCPU on nodes, without even knowing about the memory positioning of the current process running inside the guest.
                      Did you touch anything in the config of the guest? It's interesting result nonetheless. Can you share the memory allocation of the VM? You can obtain it with xl debug-keys u; xl dmesg from the Dom0.

                      ForzaF 1 Reply Last reply Reply Quote 0
                      • ForzaF Offline
                        Forza @dthenot
                        last edited by Forza

                        @dthenot said in Best CPU performance settings for HP DL325/AMD EPYC servers?:

                        @s-pam Damn, computer are really magic. I'm very surprised about these result.
                        Does the NONUMA really mean no NUMA info being given by the firmware?
                        I have no idea how the scheduler of Xen uses this information, I know that the memory allocator strip the memory of the VM on all nodes the VM is configured to be allocated on. As such it would mean the scheduler is doing good work on scheduling the VCPU on nodes, without even knowing about the memory positioning of the current process running inside the guest.
                        Did you touch anything in the config of the guest? It's interesting result nonetheless. Can you share the memory allocation of the VM? You can obtain it with xl debug-keys u; xl dmesg from the Dom0.

                        I can't look at the dmesg today as I'm home with a cold...🤧

                        Configuration between the two servers are identical except that on the "NUMA" one I enabled Last-Level Cache as NUMA node in the BIOS. When this is enabled I can see there are now 8 NUMA nodes in xl info -n.

                        The VM was identical too. I just made a fast-clone and run it parallell on each server. I also migrated the VMs back and forth between the two servers to verify that the results were correct. 🙂

                        I did experiment with xl cpupool-numa-split but this did not generate good results for multithreaded workloads. I believe this is because VMs get locked to use only as many cores as there are in each NUMA domain and this 7402P CPU has 24 cores with only 3 in each CCX.

                        dthenotD 1 Reply Last reply Reply Quote 0
                        • dthenotD Offline
                          dthenot Vates 🪐 XCP-ng Team @Forza
                          last edited by

                          @s-pam

                          I can't look at the dmesg today as I'm home with a cold...

                          I hope you get well soon 🙂

                          I did experiment with xl cpupool-numa-split but this did not generate good results for multithreaded workloads. I believe this is because VMs get locked to use only as many cores as there are in each NUMA domain.

                          Indeed, a VM in a pool get locked to use only the cores of the pool and its max amount of VCPU being the number of core in the pool. It is useful if you have the need to isolate completely the VM.
                          You need to be careful when benching these things because the memory allocation of a running VM is not moved but the VCPU will still run on the pinned node. I don't remember exactly if cpu-pool did have a different behavior than simple pinning in that case though. I remember that hard pinning a guest VCPU were not definitely not moving its memory. You could only modify this before booting.

                          1 Reply Last reply Reply Quote 1
                          • First post
                            Last post