XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Memory Consumption goes higher day by day

    Scheduled Pinned Locked Moved Compute
    32 Posts 7 Posters 3.1k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D Offline
      dhiraj26683
      last edited by olivierlambert

      Re: Alert: Control Domain Memory Usage

      Yes, it's not solving the issue. Have updated the latest Ice drivers as well. We need to restart the node to release the memory. But after a couple of days, it reaches the same state.

      We are on XCP-ng release 8.2.1

      Network cards:
      We have 1G Broadcom and Intel 10G/25G ethernet/fiber cards. But for now we are just using
      both 1G cards using bond and single fiber card with 10G module.

      [19:03 xen-srv2 ~]$ lspci | grep -i eth
      04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      32:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02)
      32:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02)
      32:00.2 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02)
      32:00.3 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02)
      b1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
      b1:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
      b2:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
      b2:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
      
      [19:05 xen-srv2 ~]$ ls -l /sys/class/net/eth6/device/driver
      lrwxrwxrwx 1 root root 0 Nov 22 18:05 /sys/class/net/eth6/device/driver -> ../../../../bus/pci/drivers/ice
      
      [19:06 xen-srv2 ~]$ modinfo ice| less
      filename:       /lib/modules/4.19.0+1/updates/drivers/net/ethernet/intel/ice/ice.ko
      firmware:       intel/ice/ddp/ice.pkg
      version:        1.11.14
      license:        GPL v2
      description:    Intel(R) Ethernet Connection E800 Series Linux Driver
      author:         Intel Corporation, <linux.nics@intel.com>
      srcversion:     ABA97333D32A1C8C8127E80
      alias:          pci:v00008086d00001888sv*sd*bc*sc*i*
      alias:          pci:v00008086d0000579Fsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000579Esv*sd*bc*sc*i*
      alias:          pci:v00008086d0000579Dsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000579Csv*sd*bc*sc*i*
      alias:          pci:v00008086d0000151Dsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000124Fsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000124Esv*sd*bc*sc*i*
      alias:          pci:v00008086d0000124Dsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000124Csv*sd*bc*sc*i*
      alias:          pci:v00008086d0000189Asv*sd*bc*sc*i*
      alias:          pci:v00008086d00001899sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001898sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001897sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001894sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001893sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001892sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001891sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001890sv*sd*bc*sc*i*
      alias:          pci:v00008086d0000188Esv*sd*bc*sc*i*
      alias:          pci:v00008086d0000188Dsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000188Csv*sd*bc*sc*i*
      alias:          pci:v00008086d0000188Bsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000188Asv*sd*bc*sc*i*
      alias:          pci:v00008086d0000159Bsv*sd*bc*sc*i*
      alias:          pci:v00008086d0000159Asv*sd*bc*sc*i*
      alias:          pci:v00008086d00001599sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001593sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001592sv*sd*bc*sc*i*
      alias:          pci:v00008086d00001591sv*sd*bc*sc*i*
      depends:        devlink,intel_auxiliary
      retpoline:      Y
      name:           ice
      vermagic:       4.19.0+1 SMP mod_unload modversions
      parm:           debug:netif level (0=none,...,16=all) (int)
      parm:           fwlog_level:FW event level to log. All levels <= to the specified value are enabled. Values: 0=none, 1=error, 2=warning, 3=normal, 4=verbose. Invalid values: >=5
       (ushort)
      parm:           fwlog_events:FW events to log (32-bit mask)
       (ulong)
      
      [19:07 xen-srv2 ~]$ ethtool eth6
      Settings for eth6:
              Supported ports: [ FIBRE ]
              Supported link modes:   1000baseT/Full
                                      25000baseCR/Full
                                      25000baseSR/Full
                                      1000baseX/Full
                                      10000baseCR/Full
                                      10000baseSR/Full
                                      10000baseLR/Full
              Supported pause frame use: Symmetric
              Supports auto-negotiation: No
              Supported FEC modes: None
              Advertised link modes:  25000baseSR/Full
                                      10000baseSR/Full
              Advertised pause frame use: No
              Advertised auto-negotiation: No
              Advertised FEC modes: None RS
              Speed: 10000Mb/s
              Duplex: Full
              Port: FIBRE
              PHYAD: 0
              Transceiver: internal
              Auto-negotiation: off
      Cannot get wake-on-lan settings: Operation not permitted
              Current message level: 0x00000007 (7)
                                     drv probe link
              Link detected: yes
      

      We have two servers with similar configurations (XEN-SRV1 and XEN-SRV2). The Control Domain memory gets full each time, so we increased the CDM to 128G on both the hosts. But strangely, that's too getting used. VM's are not using that much memory though.

      Not sure why it grabs the memory and keep it in cache, where as usage is not that much.

      XEN-SRV2

      [19:07 xen-srv2 ~]$ free -m
                    total        used        free      shared  buff/cache   available
      Mem:         128301        3012       78844          17       46444      123907
      Swap:          1023           0        1023
      

      XEN-SRV1

      [14:48 xen-srv1 ~]$ free -m
                    total        used        free      shared  buff/cache   available
      Mem:         128301        2865       34909          18       90525      124068
      Swap:          1023           0        1023
      

      Attaching memory usage graph as well. xen-srv2.png xen-srv1.png

      D yannY 2 Replies Last reply Reply Quote 0
      • D Offline
        dhiraj26683 @dhiraj26683
        last edited by

        @dhiraj26683 4042aca6-fbfd-45e7-afd0-e45450268291-image.png
        1cabffe1-0898-42d7-904b-221d85d33234-image.png

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          What could be interesting is to know which process is leaking in the dom0 🙂

          1 Reply Last reply Reply Quote 0
          • J Offline
            john.c
            last edited by john.c

            In which case it's time to do a "top" command on both hypervisor servers.

            D 1 Reply Last reply Reply Quote 0
            • D Offline
              dhiraj26683 @john.c
              last edited by

              @john-c @olivierlambert xen-srv1-top.png xen-srv1-htop.png xen-srv2-top.png xen-srv2-htop.png

              attaching both the servers htop and top output.

              D 1 Reply Last reply Reply Quote 0
              • D Offline
                dhiraj26683 @dhiraj26683
                last edited by

                @dhiraj26683 Providing here with output of below commands
                slabtop -o -s c
                cat /proc/meminfo

                266ea807-3703-4486-bbb8-3001d75f2ba6-image.png
                76ca0519-c115-4cc0-94fc-5afd71848a3d-image.png
                3d53edfd-2a25-44f9-b22d-5543bc30c01d-image.png
                af7cb0c5-2ea3-4d23-b0a7-06d0d05eb4c0-image.png

                D 1 Reply Last reply Reply Quote 0
                • D Offline
                  dhiraj26683 @dhiraj26683
                  last edited by

                  @dhiraj26683 Providing both servers ixgbe module info and rpm info, it's stock driver came along.

                  [13:59 xen-srv2 Dell-Drivers]$ modinfo ixgbe
                  filename: /lib/modules/4.19.0+1/updates/ixgbe.ko
                  version: 5.9.4
                  license: GPL
                  description: Intel(R) 10GbE PCI Express Linux Network Driver
                  author: Intel Corporation, linux.nics@intel.com
                  srcversion: AA8061C6A752528BD6CFE19

                  [13:45 xen-srv1 ~]$ modinfo ixgbe
                  filename: /lib/modules/4.19.0+1/updates/ixgbe.ko
                  version: 5.9.4
                  license: GPL
                  description: Intel(R) 10GbE PCI Express Linux Network Driver
                  author: Intel Corporation, linux.nics@intel.com
                  srcversion: AA8061C6A752528BD6CFE19

                  We tried below version update of ice modules as well,
                  ice-1.10.1.2.2
                  ice-1.12.7

                  It's the same behaviour, hence we downloaded ice drivers from Dell and installed available version which is as given below. But it's still the same.
                  ice-1.11.14

                  D 1 Reply Last reply Reply Quote 0
                  • D Offline
                    dhiraj26683 @dhiraj26683
                    last edited by

                    @dhiraj26683
                    [14:26 xen-srv2 Dell-Drivers]$ rpm -qf /lib/modules/4.19.0+1/updates/ixgbe.ko
                    intel-ixgbe-5.9.4-1.xcpng8.2.x86_64

                    D 1 Reply Last reply Reply Quote 0
                    • D Offline
                      dhiraj26683 @dhiraj26683
                      last edited by

                      @dhiraj26683 tried to find out the process. But nothing to be identified as such. There are only three guests are running on this server and it is almost there to reach the limit. After all the memory goes into cache, we will start getting notifications/alerts about Control Domain Load reached 100% and there may be a service degradation.
                      6a48517a-5aa1-49b4-910d-31fd7c1235bc-image.png

                      D 1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Do you have any extra stuff installed in your Dom0? It's very important to know it.

                        1 Reply Last reply Reply Quote 0
                        • D Offline
                          dhiraj26683 @dhiraj26683
                          last edited by olivierlambert

                          @dhiraj26683 Thanks for replying back @olivierlambert
                          Nothing as such other than Ice drivers.

                          But for now, we are not running any virtual GPU workstation from last 3-4 months, so that kind of load is not there on any of our XCP hosts.

                          But as i could say, this memory issue started resently and and the only changes that we do is to push the patches via xoa.

                          Considering this kind of issue, where memory gets fullly utilized (get into cache) and notifications start about Control Domain Load reached 100%, we didn't pushed any patches for now.

                          66d96cec-80c7-43df-a4f9-ea7795a8f0c3-image.png

                          T 1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by olivierlambert

                            Let me ping @psafont in case he got an idea on what could cause this

                            edit: also @gduperrey if he got an idea how to see what's eating all the memory

                            D 1 Reply Last reply Reply Quote 0
                            • yannY Offline
                              yann Vates 🪐 XCP-ng Team @dhiraj26683
                              last edited by

                              @dhiraj26683 the cached memory is not used by any particular process, it is used to keep eg. recently-accessed in memory to avoid reading them again from disk if the need arises. The OS is trying to make good use of otherwise-unused memory in hope of better performance, instead of letting unused memory just sitting idle.

                              If you launch a new process that would require more memory than what's currently free, the OS should happily free old cached pages for immediate reuse.

                              Did you observe anything specifically wrong, that turned you to observing memory consumption?

                              D 1 Reply Last reply Reply Quote 0
                              • D Offline
                                dhiraj26683 @olivierlambert
                                last edited by dhiraj26683

                                @olivierlambert i believe it's something related to nic drivers as we are running network intensive guests on both the servers.

                                We have a third Server, which is runing standalone. Below is it's config and only one guests runs on this host, which is XOA

                                CPU - AMD Tyzen Threadripper PRO 3975WX 32-Cores 3500 MHz
                                Memory - 320G
                                Ethernet - 1G Ethernet
                                10G Fiber
                                d672deff-e382-4d8a-aba0-427a6cdefaf1-image.png
                                intel-ixgbe-5.9.4-1.xcpng8.2.x86_64

                                As XOA does uses 10G ethernet for backup/migration operations. It seems to be caching not that much memory, but it is caching though. But not ending up utilizing all memory in cache because less operations happens here.

                                4d2b23ce-ff65-459a-865f-89d8f5cabcea-image.png

                                stormiS 1 Reply Last reply Reply Quote 0
                                • D Offline
                                  dhiraj26683 @yann
                                  last edited by

                                  @yann Hello @yann I am well aware about the caching. But the question is which process is utilizing that memory which gets accumulated into cache and ends up reaching 120G

                                  yannY 1 Reply Last reply Reply Quote 0
                                  • stormiS Offline
                                    stormi Vates 🪐 XCP-ng Team @dhiraj26683
                                    last edited by

                                    @dhiraj26683 Would you like to try a newer ixgbe? We've got 5.18.6 available in our repositories.

                                    D 1 Reply Last reply Reply Quote 0
                                    • yannY Offline
                                      yann Vates 🪐 XCP-ng Team @dhiraj26683
                                      last edited by

                                      @dhiraj26683 if it was used by a process it would be counted in used not in buff/cache. Those are used by the kernel's Virtual Filesystem subsystem.

                                      Now if your problem is that a given process fails to allocate memory while there is so much of the memory in buff/cache, then there may be something to dig in that direction, but we'll need specific symptoms to be able to help.

                                      D 1 Reply Last reply Reply Quote 1
                                      • D Offline
                                        dhiraj26683 @stormi
                                        last edited by

                                        @stormi Sure, we can try that. Thank you

                                        stormiS 1 Reply Last reply Reply Quote 0
                                        • stormiS Offline
                                          stormi Vates 🪐 XCP-ng Team @dhiraj26683
                                          last edited by

                                          @dhiraj26683 It's available as the intel-ixgbe-alt RPM, that you can install with yum install.

                                          However, I second Yann's comment: growing cache usage is not an issue, as long as it's reclaimed when another process needs more than what's available, and this is what should happen whenever such a need arises. Unless you have evidence of actual issues caused by this cache usage.

                                          D 1 Reply Last reply Reply Quote 0
                                          • D Offline
                                            dhiraj26683 @yann
                                            last edited by

                                            @yann I can understand the buff/cache part but on this server which is with 1TB physical memory and only three VM's running with 8G, 32G and 64G as their alloted memory, eating up and alloting all memory in cache is not understandable. It's getting cache means something is using it. Not sure if that makes sence though.

                                            Initially both our XCP hosts were with 16G Control domain memory. We started to face issue and alerts, we increased to 32G, then 64G, and then 128G, and it's like that for a while now.

                                            Now we are not using vGPU, so it's not getting full within 2 days where alerts starts saying Control domain memory reached it's limit

                                            stormiS yannY 2 Replies Last reply Reply Quote 0
                                            • First post
                                              Last post