XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Best CPU performance settings for HP DL325/AMD EPYC servers?

    Scheduled Pinned Locked Moved Compute
    17 Posts 4 Posters 3.4k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • ForzaF Offline
      Forza @dthenot
      last edited by Forza

      Thanks!

      The link to https://wiki.xenproject.org/wiki/Xen_on_NUMA_Machines explains it pretty well.

      xl info -n shows numa config

      # xl info -n
      host                   : srv01
      release                : 4.19.0+1
      version                : #1 SMP Tue Mar 30 22:34:15 CEST 2021
      machine                : x86_64
      nr_cpus                : 48
      max_cpu_id             : 47
      nr_nodes               : 8
      cores_per_socket       : 24
      threads_per_core       : 2
      cpu_mhz                : 2794.799
      hw_caps                : 178bf3ff:76f8320b:2e500800:244037ff:0000000f:219c91a9:00400004:00000500
      virt_caps              : pv hvm hvm_directio pv_directio hap shadow
      total_memory           : 65367
      free_memory            : 10394
      sharing_freed_memory   : 0
      sharing_used_memory    : 0
      outstanding_claims     : 0
      free_cpus              : 0
      cpu_topology           :
      cpu:    core    socket     node
        0:       0        0        0
        1:       0        0        0
        2:       1        0        0
        3:       1        0        0
        4:       2        0        0
        5:       2        0        0
        6:       4        0        1
        7:       4        0        1
        8:       5        0        1
        9:       5        0        1
       10:       6        0        1
       11:       6        0        1
       12:       8        0        2
       13:       8        0        2
       14:       9        0        2
       15:       9        0        2
       16:      10        0        2
       17:      10        0        2
       18:      12        0        3
       19:      12        0        3
       20:      13        0        3
       21:      13        0        3
       22:      14        0        3
       23:      14        0        3
       24:      16        0        4
       25:      16        0        4
       26:      17        0        4
       27:      17        0        4
       28:      18        0        4
       29:      18        0        4
       30:      20        0        5
       31:      20        0        5
       32:      21        0        5
       33:      21        0        5
       34:      22        0        5
       35:      22        0        5
       36:      24        0        6
       37:      24        0        6
       38:      25        0        6
       39:      25        0        6
       40:      26        0        6
       41:      26        0        6
       42:      28        0        7
       43:      28        0        7
       44:      29        0        7
       45:      29        0        7
       46:      30        0        7
       47:      30        0        7
      device topology        :
      device           node
      0000:00:03.1      6
      0000:c3:00.0      0
      0000:80:07.1      2
      0000:c0:03.1      0
      0000:00:08.0      6
      0000:c0:08.0      0
      0000:c5:00.3      0
      0000:00:18.3      6
      0000:02:00.2      6
      0000:40:05.0      4
      0000:c2:00.2      0
      0000:80:02.0      2
      0000:43:00.0      4
      0000:40:03.1      4
      0000:c1:00.4      0
      0000:40:08.0      4
      0000:c5:00.1      0
      0000:00:18.1      6
      0000:02:00.0      6
      0000:00:01.0      6
      0000:c2:00.0      0
      0000:42:00.2      4
      0000:80:05.0      2
      0000:c0:01.0      0
      0000:40:01.2      4
      0000:01:00.2      6
      0000:c1:00.2      0
      0000:80:03.1      2
      0000:00:04.0      6
      0000:80:08.0      2
      0000:42:00.0      4
      0000:c0:04.0      0
      0000:00:14.3      6
      0000:40:01.0      4
      0000:82:00.2      2
      0000:01:00.0      6
      0000:c1:00.0      0
      0000:41:00.2      4
      0000:00:07.0      6
      0000:c0:07.0      0
      0000:40:04.0      4
      0000:00:00.2      6
      0000:82:00.0      2
      0000:80:01.0      2
      0000:c0:00.2      0
      0000:00:18.6      6
      0000:41:00.0      4
      0000:81:00.2      2
      0000:c0:01.5      0
      0000:40:07.0      4
      0000:00:00.0      6
      0000:80:04.0      2
      0000:c0:00.0      0
      0000:40:00.2      4
      0000:00:08.1      6
      0000:c0:08.1      0
      0000:00:18.4      6
      0000:02:00.3      6
      0000:81:00.0      2
      0000:00:03.0      6
      0000:80:07.0      2
      0000:c0:03.0      0
      0000:40:00.0      4
      0000:80:00.2      2
      0000:40:08.1      4
      0000:c5:00.2      0
      0000:00:18.2      6
      0000:00:01.1      6
      0000:42:00.3      4
      0000:c0:01.1      0
      0000:40:03.0      4
      0000:80:00.0      2
      0000:c5:00.0      0
      0000:00:18.0      6
      0000:80:08.1      2
      0000:42:00.1      4
      0000:40:01.1      4
      0000:c1:00.1      0
      0000:80:03.0      2
      0000:00:07.1      6
      0000:c0:07.1      0
      0000:80:01.1      2
      0000:00:02.0      6
      0000:00:18.7      6
      0000:c0:02.0      0
      0000:40:07.1      4
      0000:00:14.0      6
      0000:c3:00.2      0
      0000:00:05.0      6
      0000:00:18.5      6
      0000:c0:05.0      0
      0000:40:02.0      4
      numa_info              :
      node:    memsize    memfree    distances
         0:     10240       1505      10,11,11,11,11,11,11,11
         1:      8192       1918      11,10,11,11,11,11,11,11
         2:      8192       1932      11,11,10,11,11,11,11,11
         3:      8192        847      11,11,11,10,11,11,11,11
         4:      8192        912      11,11,11,11,10,11,11,11
         5:      8192        912      11,11,11,11,11,10,11,11
         6:      8192       1038      11,11,11,11,11,11,10,11
         7:      8179       1326      11,11,11,11,11,11,11,10
      xen_major              : 4
      xen_minor              : 13
      xen_extra              : .1-9.9.1
      xen_version            : 4.13.1-9.9.1
      xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
      xen_scheduler          : credit
      xen_pagesize           : 4096
      platform_params        : virt_start=0xffff800000000000
      xen_changeset          : 6278553325a9, pq 70d4b5941e4f
      xen_commandline        : dom0_mem=4304M,max:4304M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G console=vga vga=mode-0x0311 sched-gran=core
      cc_compiler            : gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
      cc_compile_by          : mockbuild
      cc_compile_domain      : [unknown]
      cc_compile_date        : Thu Feb  4 18:23:36 CET 2021
      build_id               : a76c6ee84d87600fa0d520cd8ecb8113b1105af4
      xend_config_format     : 4
      

      I wonder if the CPU scheduler can do NUMA node in addition to core, CPU and socket?

      ForzaF 1 Reply Last reply Reply Quote 0
      • ForzaF Offline
        Forza @Forza
        last edited by Forza

        @s-pam said in Best CPU performance settings for HP DL325/AMD EPYC servers?:

        I wonder if the CPU scheduler can do NUMA node in addition to core, CPU and socket?

        I'll answer myself here. It seems that Xen already does this by default:

        NUMA aware scheduling, as it has been included in Xen 4.3, means that it is possible for vCPUs of a domain to just prefer to run on the pCPUs of some NUMA node. The vCPUs will still be allowed, though, to run on every pCPU, guaranteed much more flexibility than having to use pinning.

        ForzaF 1 Reply Last reply Reply Quote 0
        • ForzaF Offline
          Forza @Forza
          last edited by Forza

          Sorry for spamming the thread. 🙂

          I have two identical servers (srv01 and srv02) with AMD EPYC 7402P 24 Core CPUs. On srv02 I enabled the LLC as NUMA Node.

          I've done some quick benchmarks with Sysbench on Ubuntu 20.10 with 12 assigned cores. Command line: sysbench cpu run --threads=12

          It would seem that in this test the NUMA option is much faster, 194187 events vs 103769 events. Perhaps I am misunderstanding how sysbench works?

          b65ec3da-4b1d-430e-b90d-02542fe59552-image.png

          With 7-zip the gain is much less, but still meaningful. A little slower in single-threaded performance but quite a bit faster in multi-threaded mode.
          f9592ee9-d327-4ce1-9e34-0ee86280d9e9-image.png

          1 Reply Last reply Reply Quote 2
          • ForzaF Offline
            Forza
            last edited by

            I ran a simulation run with Dassault's SIMULIA Abaqus FEA. Simulation went down from 75 to 60 minutes, so a big win there too 😃

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              It's not spam, it's interesting feedback 🙂 Never hesitate to share it!

              ForzaF 1 Reply Last reply Reply Quote 1
              • ForzaF Offline
                Forza @olivierlambert
                last edited by

                @olivierlambert said in Best CPU performance settings for HP DL325/AMD EPYC servers?:

                It's not spam, it's interesting feedback 🙂 Never hesitate to share it!

                Thanks!

                The last benchmark is a real-world example. We have a master thesis student that needs to run approximately 150 simulations as part of the program and she is pretty thrilled to be saving several days on the run time 🙂

                dthenotD 1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  That's great! I think @dthenot would be interested reading this 😛

                  1 Reply Last reply Reply Quote 1
                  • dthenotD Offline
                    dthenot Vates 🪐 XCP-ng Team @Forza
                    last edited by

                    @s-pam Damn, computer are really magic. I'm very surprised about these result.
                    Does the NONUMA really mean no NUMA info being given by the firmware?
                    I have no idea how the scheduler of Xen uses this information, I know that the memory allocator strip the memory of the VM on all nodes the VM is configured to be allocated on. As such it would mean the scheduler is doing good work on scheduling the VCPU on nodes, without even knowing about the memory positioning of the current process running inside the guest.
                    Did you touch anything in the config of the guest? It's interesting result nonetheless. Can you share the memory allocation of the VM? You can obtain it with xl debug-keys u; xl dmesg from the Dom0.

                    ForzaF 1 Reply Last reply Reply Quote 0
                    • ForzaF Offline
                      Forza @dthenot
                      last edited by Forza

                      @dthenot said in Best CPU performance settings for HP DL325/AMD EPYC servers?:

                      @s-pam Damn, computer are really magic. I'm very surprised about these result.
                      Does the NONUMA really mean no NUMA info being given by the firmware?
                      I have no idea how the scheduler of Xen uses this information, I know that the memory allocator strip the memory of the VM on all nodes the VM is configured to be allocated on. As such it would mean the scheduler is doing good work on scheduling the VCPU on nodes, without even knowing about the memory positioning of the current process running inside the guest.
                      Did you touch anything in the config of the guest? It's interesting result nonetheless. Can you share the memory allocation of the VM? You can obtain it with xl debug-keys u; xl dmesg from the Dom0.

                      I can't look at the dmesg today as I'm home with a cold...🤧

                      Configuration between the two servers are identical except that on the "NUMA" one I enabled Last-Level Cache as NUMA node in the BIOS. When this is enabled I can see there are now 8 NUMA nodes in xl info -n.

                      The VM was identical too. I just made a fast-clone and run it parallell on each server. I also migrated the VMs back and forth between the two servers to verify that the results were correct. 🙂

                      I did experiment with xl cpupool-numa-split but this did not generate good results for multithreaded workloads. I believe this is because VMs get locked to use only as many cores as there are in each NUMA domain and this 7402P CPU has 24 cores with only 3 in each CCX.

                      dthenotD 1 Reply Last reply Reply Quote 0
                      • dthenotD Offline
                        dthenot Vates 🪐 XCP-ng Team @Forza
                        last edited by

                        @s-pam

                        I can't look at the dmesg today as I'm home with a cold...

                        I hope you get well soon 🙂

                        I did experiment with xl cpupool-numa-split but this did not generate good results for multithreaded workloads. I believe this is because VMs get locked to use only as many cores as there are in each NUMA domain.

                        Indeed, a VM in a pool get locked to use only the cores of the pool and its max amount of VCPU being the number of core in the pool. It is useful if you have the need to isolate completely the VM.
                        You need to be careful when benching these things because the memory allocation of a running VM is not moved but the VCPU will still run on the pinned node. I don't remember exactly if cpu-pool did have a different behavior than simple pinning in that case though. I remember that hard pinning a guest VCPU were not definitely not moving its memory. You could only modify this before booting.

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post