Alert: Control Domain Memory Usage

olivierlambert

Nice to see Citrix are also getting to the same conclusions

edit: thanks @fasterfourier for your feedback!

fasterfourier

Official Citrix update has been posted: https://support.citrix.com/article/CTX306529

olivierlambert

\o/

What I still find really weird is the fact we had report of the issue far longer before Citrix. And we had roughly 10 people affected while Citrix got only 1 report

fasterfourier

@olivierlambert

Probably plenty of Citrix customers were affected, but they would rather reboot on schedule than spend months working through the support process

olivierlambert

haha that might be the answer indeed…

JCastang

Hello,

Does this fix has been released or is to be released ?

stormi

@jcastang It is being tested and you can join the effort: yum update intel-ixgbe --enablerepo=xcp-ng-testing. The results are very good, I just want a bit more feedback.

JCastang

@stormi Ok, I will update one of our pools and get some results.

JCastang

@delaf Can you point me the tool you are using to get memory graphs ? (I want to check my upgraded pool).
I was searching in Advance live Telemetry with no luck.

olivierlambert

Netdata will only give you the last hour.

If you want longer metrics, you need to send the data in Prometheus/Grafana.

delaf

@jcastang we are using a netdata/prometheus/grafana stack.

@olivierlambert you can change the retention method and keep much more data on netdata. There is also (since netdata 1.18 i think) a dbengine that allows you to store data on disk.

delaf

PS: we are not using the netdata config from "Advanced telemetry": we are installing our own netdata config.

stormi

dbengine is a bit dangerous on dom0. There used to be a bug where it would keep growing forever, so I don't trust it anymore.

delaf

@stormi oh I did not know that as I never use it: I only know that it exists

delaf

@stormi Hello, some week after, I can confirm that the problem is solved here by using intel-ixgbe.x86_64@5.5.2-2.1.xcpng8.1 or intel-ixgbe.x86_64@5.5.2-2.1.xcpng8.2

delaf

PS: i'm using these 2 scripts to list all interfaces drivers version accross our servers :

$ cat get_network_drivers_info.sh
#!/bin/bash                                                                                                                                                                                                                                                                                                   

format="| %-13.13s | %-20.20s | %-20.20s | %-10.10s | %-7.7s | %-10.10s | %-30.30s | %-s \n"
printf "${format}" "date" "hostname" "OS" "interface" "driver" "version" "firmware" "yum"
printf "${format}" "----------------------------" "----------------------------" "----------------------------" "----------------------------" "----------------------------" "----------------------------" "----------------------------" "----------------------------"

if [ $# -gt 0 ]; then
    servers=($(echo ${BASH_ARGV[*]}))
else
    servers=($(cat host.json | jq -r '.[] | .address' | egrep -v "^192.168.124.9$"))
fi

for line in ${servers[@]}; do
    scp get_network_drivers_info.sh.tpl ${line}:/tmp/get_network_drivers_info.sh  > /dev/null 2>&1;
    ssh -n ${line} bash /tmp/get_network_drivers_info.sh 2> /dev/null;
    if [ $? -ne 0 ]; then
        echo "${line} fail" >&2
    fi
done

$ cat get_network_drivers_info.sh.tpl
#!/bin/bash                                                                                                                                                                                                                                                                                                   

format="| %-13.13s | %-20.20s | %-20.20s | %-10.10s | %-7.7s | %-10.10s | %-30.30s | %-s \n"
d=$(date '+%Y%m%d-%H%M')
name=$(hostname)
cd  /sys/class/net/
for interface in $(ls -l /sys/class/net/ | awk '/\/pci/ {print $9}'); do
    version=$(ethtool -i ${interface} | awk '/^version:/ {$1=""; print}')
    firmware=$(ethtool -i ${interface} | awk '/^firmware-version:/ {$1=""; print}')
    driver=$(ethtool -i ${interface} | awk '/^driver:/ {$1=""; print}')
    YUM=$(which yum)
    if [ $? -eq 0 ]; then
        packages=$(yum list installed | awk '/ixgbe/ {print $1"@"$2}' | tr '\n' ',')
    else
        packages="NA"
    fi
    os_version=$(lsb_release -d | awk '{$1=""} 1' | sed 's/XenServer/XS/; s/ (xenenterprise)//; s/release //')
    printf "${format}" "${d}" "${name}" "${os_version}" "${interface}" "${driver}" "${version}" "${firmware}" "${packages}"
done

PS: host.json file is generated via : xo-cli --list-objects type=host

stormi

FYI, I have just published security updates today PLUS the fixed ixgbe driver as an official update to XCP-ng 8.1 and 8.2.

We made it. This is the end of this huge thread.

A big thank you to everyone involved in debugging the issue.

And this is not a :D.

frankz

Its not solving it, but you can run

echo 3 > /proc/sys/vm/drop_caches

to release some of the cache again, without interfering with running processes.

[root@host2 ~]# free -m
total used free shared buff/cache available
Mem: 15958 3308 158 8 12491 2355
Swap: 1023 177 846
[root@host2 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@host2 ~]# free -m
total used free shared buff/cache available
Mem: 15958 3308 2598 10 10051 2751
Swap: 1023 177 846