Posts made by MajorP93 | XCP-ng and XO forum

MajorP93

@abudef said:

@MajorP93 Sure I did

Hmm you previously mentioned different command so I was not sure if you really followed documentation correctly.

MajorP93

@abudef said:

Nested virtualization doesn’t work very well in Xen. For example, when I set up a small test playground, I had XCP-ng 8.3 as the primary host and another XCP-ng running on it as a nested host. When I then booted Debian 12 on the nested host, it caused the entire nested host to crash and reboot. On the other hand, Windows VMs on the nested host run quite well.

Later, when I was preparing a test lab, I installed ESXi 8.0U3e on a Dell server and then deployed four virtualized XCP-ng 8.3 hosts on top of it. A number of Linux and Windows VMs run on them without any issues.

Did you follow the official documentation for nested virtualization?

https://docs.xcp-ng.org/guides/xcpng-in-a-vm/#nested-xcp-ng-using-xcp-ng

Most importantly setting via command line on pool master:

xe vm-param-set uuid=<UUID> platform:exp-nested-hvm=true

and

xe vm-param-set uuid=<UUID> platform:nic_type="e1000"

MajorP93

@dom0 As already previously mentioned XCP-ng Center / XenCenter are not officially supported and a third-party product.
It is generally advised to use Xen Orchestra for all administration / management tasks.

If it is a requirement for you to use a thick client (such as XCP-ng Center) you might want to try XenAdminQt: https://github.com/benapetr/XenAdminQt

It is also not officially supported but a very new project that gets updated frequently. Maybe that one works better for you.

MajorP93

@dinhngtu Looking forward to that! I'll stick with the Rust guest utilities for the time being—hope to see that new release soon!

MajorP93

Can also confirm that I was able to apply this round of patches using rolling update method without any issues or slowdowns on a pool of 5 hosts.

MajorP93

@dinhngtu Also if it is related to the Rust-based guest agent: do you guys plan to fix the issue and release a new version soon or is it advised to revert to Xenserver guest agent for now?

MajorP93

@dinhngtu Yes, I am using the rust based guest agent (https://gitlab.com/xen-project/xen-guest-agent) on all of my Linux guests.

Good job tracking the issue down to that!

Is a suspend/resume performed automatically during live migration?

//EDIT: I switched from Citrix/Xenserver guest agent to the Vates rust one somewhere in January so maybe my assumption of this issue being related to January round of patches was wrong.
Or maybe I did not see this issue since the ballooning down to dynamic min was broken before said round of patches according to changelog.

MajorP93

@Pilow Yeah, true.
Also during my CBT-enabled-backups tests live migrations did trigger the fallback to a full from time to time aswell (VM residing on a different host during backup run compared to last backup run).
But I did those tests ~6 months ago so maybe some fixes have been applied in that regard.

You are right, in a thick provision SR / block based scenario taking a snapshot would result in the same size of base VHD being allocated again for the purpose of snapshotting... not really practical.

I really hope that CBT receives some more love as we plan to move our storage to a vSAN cluster and intend to use iSCSI instead of NFS by that time so using CBT would also be our best bet then...
CBT also reduces the load on the SR (as in I/O) as it removes the need to constantly coalesce disks during backup job re-creation / deletion of snapshots.

@acebmxer Interesting during my tests CBT was not really stable as in backups falling back to full quite often. I recall reading somewhere in the documentation that CBT has some quirks so it seems to be a known issue.

MajorP93

@acebmxer According to your screenshots it looks like you are using CBT for your backups.
I had the same issue (backup fell back to a full) back when I was using CBT.
After disabling CBT for all backup jobs, virtual disks and therefore using classic snapshot approach all is working fine. No more fallbacks to full backups.

MajorP93

@andriy.sultanov Hello! Had to use another VM as the one used previously is a production VM and had to be rebooted in order to get dynamic max again.

This time I used a temporary test VM that I can use for these kind of tests.

After live migrating this test VM, issue is there again.
It is really easy to re-produce.

root@tmptest02:~# free -m
              gesamt       benutzt     frei      gemns.  Puffer/Cache verfügbar
Speicher:       3794         344        3511           3         109        3449
Swap:

journalctl -k | cat

Mär 11 14:26:19 tmptest02 kernel: Freezing user space processes
Mär 11 14:26:23 tmptest02 kernel: Freezing user space processes completed (elapsed 0.018 seconds)
Mär 11 14:26:23 tmptest02 kernel: OOM killer disabled.
Mär 11 14:26:23 tmptest02 kernel: Freezing remaining freezable tasks
Mär 11 14:26:23 tmptest02 kernel: Freezing remaining freezable tasks completed (elapsed 0.003 seconds)
Mär 11 14:26:23 tmptest02 kernel: suspending xenstore...
Mär 11 14:26:23 tmptest02 kernel: xen:grant_table: Grant tables using version 1 layout
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=9, pirq=16
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=8, pirq=17
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=12, pirq=18
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=1, pirq=19
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=6, pirq=20
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=4, pirq=21
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=7, pirq=22
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=23, pirq=23
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=28, pirq=24
Mär 11 14:26:23 tmptest02 kernel: usb usb1: root hub lost power or was reset
Mär 11 14:26:23 tmptest02 kernel: ata2: found unknown device (class 0)
Mär 11 14:26:23 tmptest02 kernel: usb 1-2: reset full-speed USB device number 2 using uhci_hcd
Mär 11 14:26:23 tmptest02 kernel: OOM killer enabled.
Mär 11 14:26:23 tmptest02 kernel: Restarting tasks: Starting
Mär 11 14:26:23 tmptest02 kernel: Restarting tasks: Done
Mär 11 14:26:23 tmptest02 kernel: Setting capacity to 125829120

xensource.log excerpt on the target XCP-ng host that the VM got live migrated to:

[14:35 xcpng02 log]# cat xensource.log | grep "Mar 11 14:26" | grep squeeze
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] total_range = 20971520 gamma = 1.000000 gamma' = 18.007186
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] Total additional memory over dynamic_min = 377638052 KiB; will set gamma = 1.00 (leaving unallocated 356666532 KiB)
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] free_memory_range ideal target = 4296680
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] change_host_free_memory required_mem = 4305896 KiB target_mem = 9216 KiB free_mem = 371880116 KiB
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] change_host_free_memory all VM target meet true
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||memory] reserved 4296680 kib for reservation 6f0f8c43-7ffa-ffbf-7723-a1be3c1a61d1
Mar 11 14:26:06 xcpng02 squeezed: [debug||254 ||squeeze_xen] Xenctrl.domain_setmaxmem domid=53 max=4297704 (was=0)
Mar 11 14:26:13 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] Adding watches for domid: 53
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] Removing watches for domid: 52
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/initial-reservation <- 4296680
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/target <- 4196352
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /control/feature-balloon <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/memory-offset <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/uncooperative <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/dynamic-min <- 4194304
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/dynamic-max <- 10485760
Mar 11 14:26:21 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1
Mar 11 14:26:22 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1
Mar 11 14:26:25 xcpng02 squeezed: [debug||3 ||squeeze_xen] domid 53 just started a guest agent (but has no balloon driver); calibrating memory-offset = 2024 KiB
Mar 11 14:26:25 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/memory-offset <- 2024
Mar 11 14:26:25 xcpng02 squeezed: [debug||3 ||squeeze_xen] Xenctrl.domain_setmaxmem domid=53 max=4199400 (was=4297704)
Mar 11 14:26:28 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1

I hope the filtered xensource.log (filtered for "squeezed") is enough or do you need other events aswell @andriy.sultanov ?

MajorP93

Hello @dinhngtu , thanks for your reply.
I had to use another VM, that also has the issue, this time since the other one used for illustration in my initial post got rebooted and is back to normal again.

Below please find the relevant output:

dmesg:

[ 7085.628802] Freezing user space processes
[ 7085.652324] Freezing user space processes completed (elapsed 0.023 seconds)
[ 7085.652367] OOM killer disabled.
[ 7085.652370] Freezing remaining freezable tasks
[ 7092.127309] Freezing remaining freezable tasks completed (elapsed 6.475 seconds)
[ 7092.178860] suspending xenstore...
[ 7092.197832] xen:grant_table: Grant tables using version 1 layout
[ 7092.198328] xen: --> irq=9, pirq=16
[ 7092.198349] xen: --> irq=8, pirq=17
[ 7092.198370] xen: --> irq=12, pirq=18
[ 7092.198390] xen: --> irq=1, pirq=19
[ 7092.198423] xen: --> irq=6, pirq=20
[ 7092.198479] xen: --> irq=4, pirq=21
[ 7092.198525] xen: --> irq=7, pirq=22
[ 7092.198571] xen: --> irq=23, pirq=23
[ 7092.198616] xen: --> irq=28, pirq=24
[ 7092.218627] usb usb1: root hub lost power or was reset
[ 7092.378841] ata2: found unknown device (class 0)
[ 7092.463544] usb 1-2: reset full-speed USB device number 2 using uhci_hcd
[ 7092.705603] OOM killer enabled.
[ 7092.705610] Restarting tasks: Starting
[ 7092.707866] Restarting tasks: Done
[ 7092.746034] Setting capacity to 125829120

root@docker-xo01:~# free -m
               total        used        free      shared  buff/cache   available
Mem:            4821        1850         683           3        2654        2970
Swap:           7123          31        7092

[12:40 xcpng01 ~]# xe vm-param-get uuid=b010699a-4c63-b44d-b5ac-c7b51e7c0d68 param-name=memory-target
5372903424

This is the memory configuration of this particular VM in Xen Orchestra:

@dinhngtu Anything else to check/provide? I could reboot the VM and get output of "xe vm-param-get uuid=... param-name=memory-target" again.

Best regards

MajorP93

Hello XCP-ng community and Vates team,

I have noticed an issue regarding dynamic memory control (DMC) or rather with its memory ballooning feature.
I have DMC configured for quite a few VMs.

Before January's round of patches, everything had been working fine but since applying those patches the VM that has DMC enabled gets its RAM shrunk down to "dynamic min" during live migration but it never gets expanded back to its normal value after the live migration finished.
It stays with its dynamic min RAM until it gets rebooted.
After cleanly rebooting the VM via Xen Orchestra RAM is at its normal value (dynamic max) again.

I did not report this issue earlier since I was hoping it will eventually get fixed by itself by March round of patches but unfortunately the issue is still there.

I suspect the bug was introduced by this change that is mentioned in January update release notes:
"Fix regression on dynamic memory management, in XAPI, during live migration, causing VMs not to balloon down before the migration."
I feel like after this change VMs get ballooned down more aggressively before the migration but in my case never get ballooned up again.

I currently have multiple VMs that have this issue. Below please find dmesg output of one of the VMs in question:

[24931.647978] Freezing user space processes
[24931.693168] Freezing user space processes completed (elapsed 0.044 seconds)
[24931.693392] OOM killer disabled.
[24931.693395] Freezing remaining freezable tasks
[24931.704356] Freezing remaining freezable tasks completed (elapsed 0.010 seconds)
[24931.764470] suspending xenstore...
[24931.859456] xen:grant_table: Grant tables using version 1 layout
[24931.859792] xen: --> irq=9, pirq=16
[24931.859818] xen: --> irq=8, pirq=17
[24931.859843] xen: --> irq=12, pirq=18
[24931.859867] xen: --> irq=1, pirq=19
[24931.859891] xen: --> irq=6, pirq=20
[24931.859913] xen: --> irq=4, pirq=21
[24931.859937] xen: --> irq=7, pirq=22
[24931.859960] xen: --> irq=23, pirq=23
[24931.859984] xen: --> irq=28, pirq=24
[24931.889330] usb usb1: root hub lost power or was reset
[24932.179596] usb 1-2: reset full-speed USB device number 2 using uhci_hcd
[24932.436726] OOM killer enabled.
[24932.436732] Restarting tasks ... done.
[24932.492024] Setting capacity to 1048576000
[24932.493343] Setting capacity to 524288000
[24932.494439] Setting capacity to 4273496064

After live migration, before rebooting:

root@mldmzdocker:~# free -m
               total        used        free      shared  buff/cache   available
Mem:            2300        1135         352         109        1185        1164
Swap:            979          65         914

After rebooting:

root@mldmzdocker:~# free -m
               total        used        free      shared  buff/cache   available
Mem:            7932         754        6747           9         679        7178
Swap:            979           0         979

Also: this issue seems to only occur on Linux guests. I tested on Debian 12/13 and Windows Server 2019.

Please let me know if I can provide something else that could possibly help to track down the issue.

Thanks in advance and best regards.

MajorP93

@florent Thanks for the reply. I will rebuild XO and test with current master.

MajorP93

After leaving my XO CE VM without reboot for 9 days I got hit with increased RAM usage and "interrupted" backup error again.

@florent is there a branch already that I can test?

MajorP93

@gduperrey After running "yum clean metadata && yum check-update" the package updates are shown. Thanks.

MajorP93

@gduperrey On my pool master there are no updates available in yum repository (checked via "yum check-update").
Will it take some more time?

MajorP93

@acebmxer Okay I see. Yeah that correlates with @florent 's observation that the problem was not visible in a small test environment. Only during high backup load the problem really becomes visible. (In my case 106 VMs, backup jobs running in parallel)

MajorP93

@acebmxer That is very interesting! So according to your tests the XOA backup memory leak is only occurring when Windows VMs are being backed up?

MajorP93

@olivierlambert I understood him in a way that his XO has multiple pools connected to it hence the suggestion to disconnect the pool that only has this one host as a member.

MajorP93

@Kajetan321 said in Deteching Host is Failing with Error:

Cannot_eject_master

According to your log the host that you are trying to decommission / detach is your pool master.
You have to designate a new master before you can detach that particular host.

You can do that via Xen Orchestra: Home --> Pools --> YourPool --> Advanced --> Master --> click on name of current master and select a new one.

After that you should be able to detach the host in question.

//EDIT: Oh I just read that it is the only host in that pool. In that case you should disconnect the whole pool (the pool that has only 1 host, the 1 host in question).