XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Alert: Control Domain Memory Usage

    Scheduled Pinned Locked Moved Solved Compute
    194 Posts 21 Posters 201.4k Views 16 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stormiS Offline
      stormi Vates 🪐 XCP-ng Team @dave
      last edited by

      @dave So, at this point our theories are:

      • dom0 memory ballooning
      • a kernel memory leak
      • each of us being really bad at understanding RAM usage in dom0 🤔

      Can you share the contents of your grub.cfg, the line starting with "Domain-0" in the output of xl top, and output of xe vm-param-list uuid={YOUR_DOM0_VM_UUID} | grep memory?

      1 Reply Last reply Reply Quote 0
      • I Offline
        inaki.martinez
        last edited by

        @stormi this is the current ps aux: ps-aux.txt
        @r1 the sar file is too big to add it here but here is a link sar.txt (valid for a day), and the kernel oom message too messages.txt . From what I can see only around 3GB where accounted for when the OOM killer was triggered (Dom0 has 8GB of memory available).
        In this case rsyslog was killed but I have seen xapi killed on other occasions. I can dig up the logs if they can help.

        stormiS 1 Reply Last reply Reply Quote 0
        • I Offline
          inaki.martinez
          last edited by

          @stormi

          • grub.cfg grub.txt
          • xl top for Dom0
            Domain-0 -----r 5461432 0.0 8388608 1.6 8388608 1.6 16 0 0 0 0 0 0 0 0 0 0
          • xe param list for Dom0 (memory)
                                   memory-target ( RO): <unknown>
                                 memory-overhead ( RO): 118489088
                               memory-static-max ( RW): 8589934592
                              memory-dynamic-max ( RW): 8589934592
                              memory-dynamic-min ( RW): 8589934592
                               memory-static-min ( RW): 4294967296
                                last-boot-record ( RO): '('struct' ('uuid' '5e1386d5-e2c9-47eb-8445-77674d76c803') ('allowed_operations' ('array')) ('current_operations' ('struct')) ('power_state' 'Running') ('name_label' 'Control domain on host: bc2-vi-srv03') ('name_description' 'The domain which manages physical devices and manages other domains') ('user_version' '1') ('is_a_template' ('boolean' '0')) ('is_default_template' ('boolean' '0')) ('suspend_VDI' 'OpaqueRef:NULL') ('resident_on' 'OpaqueRef:946c6678-044a-62ab-2a98-f8c93e34ade9') ('affinity' 'OpaqueRef:946c6678-044a-62ab-2a98-f8c93e34ade9') ('memory_overhead' '84934656') ('memory_target' '4294967296') ('memory_static_max' '4294967296') ('memory_dynamic_max' '4294967296') ('memory_dynamic_min' '4294967296') ('memory_static_min' '4294967296') ('VCPUs_params' ('struct')) ('VCPUs_max' '48') ('VCPUs_at_startup' '48') ('actions_after_shutdown' 'destroy') ('actions_after_reboot' 'destroy') ('actions_after_crash' 'destroy') ('consoles' ('array' 'OpaqueRef:aa16584e-48c6-70a3-98c0-a2ee63b3cfa4' 'OpaqueRef:01efe105-d6fe-de5e-e214-9c6e2b5be498')) ('VIFs' ('array')) ('VBDs' ('array')) ('crash_dumps' ('array')) ('VTPMs' ('array')) ('PV_bootloader' '') ('PV_kernel' '') ('PV_ramdisk' '') ('PV_args' '') ('PV_bootloader_args' '') ('PV_legacy_args' '') ('HVM_boot_policy' '') ('HVM_boot_params' ('struct')) ('HVM_shadow_multiplier' ('double' '1')) ('platform' ('struct')) ('PCI_bus' '') ('other_config' ('struct' ('storage_driver_domain' 'OpaqueRef:166e5128-4906-05cc-bb8d-ec99a3c13dc0') ('is_system_domain' 'true'))) ('domid' '0') ('domarch' 'x64') ('last_boot_CPU_flags' ('struct')) ('is_control_domain' ('boolean' '1')) ('metrics' 'OpaqueRef:2207dad4-d07f-d7f9-9ebb-796072aa37e1') ('guest_metrics' 'OpaqueRef:NULL') ('last_booted_record' '') ('recommendations' '') ('xenstore_data' ('struct')) ('ha_always_run' ('boolean' '0')) ('ha_restart_priority' '') ('is_a_snapshot' ('boolean' '0')) ('snapshot_of' 'OpaqueRef:NULL') ('snapshots' ('array')) ('snapshot_time' ('dateTime.iso8601' '19700101T00:00:00Z')) ('transportable_snapshot_id' '') ('blobs' ('struct')) ('tags' ('array')) ('blocked_operations' ('struct')) ('snapshot_info' ('struct')) ('snapshot_metadata' '') ('parent' 'OpaqueRef:NULL') ('children' ('array')) ('bios_strings' ('struct')) ('protection_policy' 'OpaqueRef:NULL') ('is_snapshot_from_vmpp' ('boolean' '0')) ('snapshot_schedule' 'OpaqueRef:NULL') ('is_vmss_snapshot' ('boolean' '0')) ('appliance' 'OpaqueRef:NULL') ('start_delay' '0') ('shutdown_delay' '0') ('order' '0') ('VGPUs' ('array')) ('attached_PCIs' ('array')) ('suspend_SR' 'OpaqueRef:NULL') ('version' '0') ('generation_id' '') ('hardware_platform_version' '0') ('has_vendor_device' ('boolean' '0')) ('requires_reboot' ('boolean' '0')) ('reference_label' ''))'
                                          memory (MRO): <not in database>
          
          
          stormiS 1 Reply Last reply Reply Quote 0
          • stormiS Offline
            stormi Vates 🪐 XCP-ng Team @inaki.martinez
            last edited by

            @inaki-martinez According to this log, 2GB of Resident Set Size was freed by killing rsyslog. This is a lot for such a system service.

            J 1 Reply Last reply Reply Quote 0
            • J Offline
              JeffBerntsen Top contributor @stormi
              last edited by

              @stormi I seem to remember running across a similar problem on a RHEL system. Since XCP-ng is based on Centos which is pretty much the same thing, could it be related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1663267

              1 Reply Last reply Reply Quote 0
              • stormiS Offline
                stormi Vates 🪐 XCP-ng Team @inaki.martinez
                last edited by

                @JeffBerntsen this could be indeed. Advisory for the fix is https://access.redhat.com/errata/RHSA-2020:1000. I'll consider a backport.

                @inaki-martinez I think dom0 memory ballooning (if that is even a thing... I need to confirm) is ruled out in your case. The sum of the RSS values for all processes (which is a simplistic and overestimating way of determining the RAM usage for all processes, due to shared memory), is around 1.5GB which leaves more than 4.5GB unexplained.

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  RHSA-2020:1000 is an interesting lead, indeed 🙂

                  1 Reply Last reply Reply Quote 0
                  • D Offline
                    daKju
                    last edited by daKju

                    @stormi
                    i have the problem on a PoolMaster with 2 running VM's with memory alerts.
                    here are some infos. may you find something.

                    slabtop.txt
                    xehostparamlist.txt
                    xltop.txt
                    meminfo.txt
                    top.txt
                    grub.cfg.txt

                    sorry, can't add images. it seems there is something broken with some node modules.

                    stormiS 1 Reply Last reply Reply Quote 0
                    • stormiS Offline
                      stormi Vates 🪐 XCP-ng Team @daKju
                      last edited by

                      @daKju Thanks. What version of XCP-ng? Does restarting the rsyslog service or the openvswitch release RAM?

                      D 1 Reply Last reply Reply Quote 0
                      • D Offline
                        daKju @stormi
                        last edited by daKju

                        @stormi
                        we have 8.1
                        I haven't start yet the services. can the openvswitch service safely restart without any impact?
                        nothing changed after rsyslog restart

                        stormiS 1 Reply Last reply Reply Quote 0
                        • stormiS Offline
                          stormi Vates 🪐 XCP-ng Team @daKju
                          last edited by

                          @daKju I must admit I can't guarantee that it is perfectly safe. It will at least induce a small network downtime.

                          1 Reply Last reply Reply Quote 0
                          • daveD Offline
                            dave
                            last edited by

                            Don`t restart openvswitch, if you have active iSCSI storage attached.

                            stormiS D 2 Replies Last reply Reply Quote 1
                            • stormiS Offline
                              stormi Vates 🪐 XCP-ng Team @dave
                              last edited by

                              @dave since you're here, can you share the contents of your grub.cfg, the line starting with "Domain-0" in the output of xl top, and the output of xe vm-param-list uuid={YOUR_DOM0_VM_UUID} | grep memory?

                              And if your offer for remote access to a server to try and find where the missing memory is being used still stands, I'm interested.

                              1 Reply Last reply Reply Quote 1
                              • stormiS Offline
                                stormi Vates 🪐 XCP-ng Team
                                last edited by

                                Another lead, although quite old: https://serverfault.com/questions/520490/very-high-memory-usage-but-not-claimed-by-any-process

                                In that situation the memory was seemingly taken by operations related to LVM, and stopping all LVM operations released the memory. Not easy to test in production though.

                                daveD 1 Reply Last reply Reply Quote 0
                                • daveD Offline
                                  dave
                                  last edited by dave

                                  Current Top:

                                  top - 15:38:00 up 62 days,  4:22,  2 users,  load average: 0.06, 0.08, 0.08
                                  Tasks: 295 total,   1 running, 188 sleeping,   0 stopped,   0 zombie
                                  %Cpu(s):  0.6 us,  0.0 sy,  0.0 ni, 99.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                  KiB Mem : 12210160 total,  3596020 free,  7564312 used,  1049828 buff/cache
                                  KiB Swap:  1048572 total,  1048572 free,        0 used.  4420052 avail Mem
                                  
                                    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                   2516 root      20   0  888308 123224  25172 S   0.0  1.0 230:49.43 xapi
                                   1947 root      10 -10  712372  89348   9756 S   0.0  0.7 616:24.82 ovs-vswitc+
                                   1054 root      20   0  102204  30600  15516 S   0.0  0.3  23:00.23 message-sw+
                                   2515 root      20   0  493252  25388  12884 S   0.0  0.2 124:03.44 xenopsd-xc
                                   2527 root      20   0  244124  25128   8952 S   0.0  0.2   0:24.59 python
                                   1533 root      20   0  277472  23956   7928 S   0.0  0.2 161:16.62 xcp-rrdd
                                   2514 root      20   0   95448  19204  11588 S   0.0  0.2 104:18.98 xapi-stora+
                                   1069 root      20   0   69952  17980   9676 S   0.0  0.1   0:23.74 varstored-+
                                   2042 root      20   0  138300  17524   9116 S   0.0  0.1  71:06.89 xcp-networ+
                                   2524 root      20   0  211832  17248   7728 S   0.0  0.1   8:15.16 python
                                   2041 root      20   0  223856  16836   7840 S   0.0  0.1   0:00.28 python
                                  26502 65539     20   0  334356  16236   9340 S   0.0  0.1 603:42.74 qemu-syste+
                                   5724 65540     20   0  208404  15400   9240 S   0.0  0.1 469:19.79 qemu-syste+
                                   2528 root      20   0  108192  14760  10284 S   0.0  0.1   0:00.01 xapi-nbd
                                   9482 65537     20   0  316948  14204   9316 S   0.0  0.1 560:47.71 qemu-syste+
                                  24445 65541     20   0  248332  13704   9124 S   0.0  0.1  90:45.58 qemu-syste+
                                   1649 root      20   0   62552  13340   6172 S   0.0  0.1  60:28.97 xcp-rrdd-x+
                                  

                                  Requested Files:

                                  xl top.txt
                                  dom0 param list.txt
                                  grub.cfg.txt

                                  1 Reply Last reply Reply Quote 0
                                  • daveD Offline
                                    dave @stormi
                                    last edited by

                                    @stormi Usualy i migrate all vms of affected hosts to others, when memory is nearly full. But that does not free any memory. Could LVM operations be still the happening, with no VMs running?

                                    stormiS 1 Reply Last reply Reply Quote 0
                                    • stormiS Offline
                                      stormi Vates 🪐 XCP-ng Team @dave
                                      last edited by

                                      @dave I'm not able to tell.

                                      However, this all looks like a memory leak in a kernel driver or module. Maybe we should try to find a common pattern between the affected hosts, by looking at the output of lsmod to know which modules are loaded.

                                      1 Reply Last reply Reply Quote 0
                                      • beshlemanB Offline
                                        beshleman
                                        last edited by

                                        Recompiling the kernel with kmemleak might be the fastest route to finding a solution here. Unfortunately, this obviously requires a reboot and will incur some performance hit while kmemleak is enabled (likely not an option for some systems).

                                        https://www.kernel.org/doc/html/latest/dev-tools/kmemleak.html

                                        1 Reply Last reply Reply Quote 0
                                        • D Offline
                                          daKju @dave
                                          last edited by

                                          @dave thx for this hint. yes we have iSCSI SR's attached

                                          1 Reply Last reply Reply Quote 0
                                          • stormiS Offline
                                            stormi Vates 🪐 XCP-ng Team
                                            last edited by

                                            So, the most probable cause of the growing memory usage people on this thread see is a memory leak in the Linux kernel, or more probably in a driver module.

                                            Could you all share the output of lsmod so that we can try to identify a common factor between all affected hosts?

                                            I M 2 Replies Last reply Reply Quote 0
                                            • First post
                                              Last post