MajorP93

MajorP93

Hello XCP-ng community!

Since Vates released the new OpenMetrics plugin for Xen Orchestra we now have an official, built-in exporter for Prometheus metrics!

I was using xen-exporter before in order to make hypervisor internal RRD database available in the form of Prometheus metrics.
I migrated to the new plugin which works just fine.

I updated the Grafana dashboard that I was using in order to be compatible with the official OpenMetrics plugin and thought "why not share it with other users"?

In case you are interested you can find my dashboard JSON here: https://gist.github.com/MajorP93/3a933a6f03b4c4e673282fb54a68474b

It is based on the xen-exporter dashboard made by MikeDombo: https://grafana.com/grafana/dashboards/16588-xen/

In case you also use Prometheus for scraping Xen Orchestra OpenMetrics plugin in combination with Grafana you can copy the JSON from my gist, import it and you are ready to go!

Hope it helps!

Might even be a good idea to include the dashboard as an example in the Xen Orchestra documentation.

Best regards

MajorP93

@MathieuRA I disabled Traefik and reverted to my old XO config (port 443, ssl encryption, http to https redirection), rebuild the docker container using your branch and tested:

it is working fine on my end now

Thank you very much!

I did not expect this to get fixed so fast!

MajorP93

@Pilow said in backup mail report says INTERRUPTED but it's not ?:

@MajorP93 you say to have 8GB Ram on XO, but it OOMkills at 5Gb Used RAM.

did you do those additionnal steps in your XO Config ?

You can increase the memory allocated to the XOA VM (from 2GB to 4GB or 8GB).
Note that simply increasing the RAM for the VM is not enough.
You must also edit the service file (/etc/systemd/system/xo-server.service) 
to increase the memory allocated to the xo-server process itself.

You should leave ~512MB for the debian OS itself. Meaning if your VM has 4096MB total RAM, you should use 3584 for the memory value below.

- ExecStart=/usr/local/bin/xo-server
+ ExecStart=/usr/local/bin/node --max-old-space-size=3584 /usr/local/bin/xo-server
The last step is to refresh and restart the service:

$ systemctl daemon-reload
$ systemctl restart xo-server

Interesting!
I did not know that it is recommended to set "--max-old-space-size=" as a startup parameter for Node JS with the result of (total system ram - 512MB).
I added that, restarted XO and my backup job.

I will test if that gives my backup jobs more stability.
Thank you very much for taking the time and recommending the parameter.

MajorP93

@Mang0Musztarda said in Xen Orchestra OpenMetrics Plugin - Grafana Dashboard:

@MajorP93 hi, how can i scrape openmetrics endpoint?
i set up openmetrics plugin prometheus secret, enabled it, and ten tried to use curl like that: curl -H "Authorization: Bearer abc123" http://localhost:9004
but response i got was
{"error":"Query authentication does not match server setting"}
what am i doing wrong?

Hey!
I scrape it like so:

root@prometheus01:~# cat /etc/prometheus/scrape_configs/xen-orchestra-openmetrics.yml 
scrape_configs:
  - job_name: xen-orchestra
    honor_labels: true
    scrape_interval: 30s
    scrape_timeout: 20s
    scheme: https
    tls_config:
      insecure_skip_verify: true
    bearer_token_file: /etc/prometheus/bearer.token
    metrics_path: /openmetrics/metrics
    static_configs:
    - targets:
      - xen-orchestra.domain.local

/etc/prometheus/bearer.token file contains the bearer token as configured in openmetrics xen orchestra plugin.

best regards

MajorP93

Can also confirm that I was able to apply this round of patches using rolling update method without any issues or slowdowns on a pool of 5 hosts.

MajorP93

@rzr Thank you very much!

@michmoor0725 Absolutely! The community is another aspect of why working with XCP-ng is a lot more fun compared to working with VMWare!

MajorP93

@florent said in [VDDK V2V] Migration of VM that had more than 1 snapshot creates multiple VHDs:

@MajorP93 the size are different between the disks, did you modify it since the snapshots ?

would it be possible to take one new snapshot with the same disk structure ?

Sorry it was my bad indeed.
On the VMWare side there are 2 VMs that have almost the exact same name.
When I checked for disk layout to verify this was an issue I looked at the wrong VM.

I checked again and can confirm that the VM in question has 1x 60GiB and 1x 25GiB VMDK.

So this is not an issue. It is working as intended.

Thread can be closed / deleted.
Sorry again and thanks for the replies.

Best regards
MajorP

MajorP93

said in Xen Orchestra Node 24 compatibility:

After moving from Node 22 to Node 24 on my XO instance I started to see more "Error: ENOMEM: not enough memory, close" for my backup jobs even though my XO VM has 8GB of RAM...

I will revert back to Node 22 for now.

I did some further troubleshooting and was able to pinpoint it down to SMB encryption on Xen Orchestra backup remotes ("seal" CIFS mount flag).
"ENOMEM" errors seem to occur only when I enable previously explained option.
Seems to be related to some buffering that is controlled by Linux kernel CIFS implementation that is failing when SMB encryption is being used.
CIFS operation gets killed due to buffer exhaustion caused by encryption and Xen Orchestra shows "ENOMEM".
Somehow this issue gets more visible when using Node 24 vs Node 22 which is why I thought it was caused by the Node version + XO version combination. I switched Node version at the same time I enabled SMB encryption.
However this seems to be not directly related to Xen Orchestra and more a Node / Linux kernel CIFS implementation thing.
Apparently not a Xen Orchestra bug per se.

MajorP93

@dom0 As already previously mentioned XCP-ng Center / XenCenter are not officially supported and a third-party product.
It is generally advised to use Xen Orchestra for all administration / management tasks.

If it is a requirement for you to use a thick client (such as XCP-ng Center) you might want to try XenAdminQt: https://github.com/benapetr/XenAdminQt

It is also not officially supported but a very new project that gets updated frequently. Maybe that one works better for you.

MajorP93

Hey,
small update:
while adding the backup section and "diskPerVmConcurrency" option to "/etc/xo-server/config.diskConcurrency.toml" or "~/.config/xo-server/config.diskConcurrency.toml" had no effect for me, I was able to get this working by adding it at the end of my main XO config file at "/etc/xo-server/config.toml".

Best regards

MajorP93

@abudef said:

@MajorP93 Sure I did

Hmm you previously mentioned different command so I was not sure if you really followed documentation correctly.

MajorP93

@abudef said:

Nested virtualization doesn’t work very well in Xen. For example, when I set up a small test playground, I had XCP-ng 8.3 as the primary host and another XCP-ng running on it as a nested host. When I then booted Debian 12 on the nested host, it caused the entire nested host to crash and reboot. On the other hand, Windows VMs on the nested host run quite well.

Later, when I was preparing a test lab, I installed ESXi 8.0U3e on a Dell server and then deployed four virtualized XCP-ng 8.3 hosts on top of it. A number of Linux and Windows VMs run on them without any issues.

Did you follow the official documentation for nested virtualization?

https://docs.xcp-ng.org/guides/xcpng-in-a-vm/#nested-xcp-ng-using-xcp-ng

Most importantly setting via command line on pool master:

xe vm-param-set uuid=<UUID> platform:exp-nested-hvm=true

and

xe vm-param-set uuid=<UUID> platform:nic_type="e1000"

MajorP93

@dom0 As already previously mentioned XCP-ng Center / XenCenter are not officially supported and a third-party product.
It is generally advised to use Xen Orchestra for all administration / management tasks.

If it is a requirement for you to use a thick client (such as XCP-ng Center) you might want to try XenAdminQt: https://github.com/benapetr/XenAdminQt

It is also not officially supported but a very new project that gets updated frequently. Maybe that one works better for you.

MajorP93

@dinhngtu Looking forward to that! I'll stick with the Rust guest utilities for the time being—hope to see that new release soon!

MajorP93

Can also confirm that I was able to apply this round of patches using rolling update method without any issues or slowdowns on a pool of 5 hosts.

MajorP93

@dinhngtu Also if it is related to the Rust-based guest agent: do you guys plan to fix the issue and release a new version soon or is it advised to revert to Xenserver guest agent for now?

MajorP93

@dinhngtu Yes, I am using the rust based guest agent (https://gitlab.com/xen-project/xen-guest-agent) on all of my Linux guests.

Good job tracking the issue down to that!

Is a suspend/resume performed automatically during live migration?

//EDIT: I switched from Citrix/Xenserver guest agent to the Vates rust one somewhere in January so maybe my assumption of this issue being related to January round of patches was wrong.
Or maybe I did not see this issue since the ballooning down to dynamic min was broken before said round of patches according to changelog.

MajorP93

@Pilow Yeah, true.
Also during my CBT-enabled-backups tests live migrations did trigger the fallback to a full from time to time aswell (VM residing on a different host during backup run compared to last backup run).
But I did those tests ~6 months ago so maybe some fixes have been applied in that regard.

You are right, in a thick provision SR / block based scenario taking a snapshot would result in the same size of base VHD being allocated again for the purpose of snapshotting... not really practical.

I really hope that CBT receives some more love as we plan to move our storage to a vSAN cluster and intend to use iSCSI instead of NFS by that time so using CBT would also be our best bet then...
CBT also reduces the load on the SR (as in I/O) as it removes the need to constantly coalesce disks during backup job re-creation / deletion of snapshots.

@acebmxer Interesting during my tests CBT was not really stable as in backups falling back to full quite often. I recall reading somewhere in the documentation that CBT has some quirks so it seems to be a known issue.

MajorP93

@acebmxer According to your screenshots it looks like you are using CBT for your backups.
I had the same issue (backup fell back to a full) back when I was using CBT.
After disabling CBT for all backup jobs, virtual disks and therefore using classic snapshot approach all is working fine. No more fallbacks to full backups.

MajorP93

@andriy.sultanov Hello! Had to use another VM as the one used previously is a production VM and had to be rebooted in order to get dynamic max again.

This time I used a temporary test VM that I can use for these kind of tests.

After live migrating this test VM, issue is there again.
It is really easy to re-produce.

root@tmptest02:~# free -m
              gesamt       benutzt     frei      gemns.  Puffer/Cache verfügbar
Speicher:       3794         344        3511           3         109        3449
Swap:

journalctl -k | cat

Mär 11 14:26:19 tmptest02 kernel: Freezing user space processes
Mär 11 14:26:23 tmptest02 kernel: Freezing user space processes completed (elapsed 0.018 seconds)
Mär 11 14:26:23 tmptest02 kernel: OOM killer disabled.
Mär 11 14:26:23 tmptest02 kernel: Freezing remaining freezable tasks
Mär 11 14:26:23 tmptest02 kernel: Freezing remaining freezable tasks completed (elapsed 0.003 seconds)
Mär 11 14:26:23 tmptest02 kernel: suspending xenstore...
Mär 11 14:26:23 tmptest02 kernel: xen:grant_table: Grant tables using version 1 layout
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=9, pirq=16
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=8, pirq=17
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=12, pirq=18
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=1, pirq=19
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=6, pirq=20
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=4, pirq=21
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=7, pirq=22
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=23, pirq=23
Mär 11 14:26:23 tmptest02 kernel: xen: --> irq=28, pirq=24
Mär 11 14:26:23 tmptest02 kernel: usb usb1: root hub lost power or was reset
Mär 11 14:26:23 tmptest02 kernel: ata2: found unknown device (class 0)
Mär 11 14:26:23 tmptest02 kernel: usb 1-2: reset full-speed USB device number 2 using uhci_hcd
Mär 11 14:26:23 tmptest02 kernel: OOM killer enabled.
Mär 11 14:26:23 tmptest02 kernel: Restarting tasks: Starting
Mär 11 14:26:23 tmptest02 kernel: Restarting tasks: Done
Mär 11 14:26:23 tmptest02 kernel: Setting capacity to 125829120

xensource.log excerpt on the target XCP-ng host that the VM got live migrated to:

[14:35 xcpng02 log]# cat xensource.log | grep "Mar 11 14:26" | grep squeeze
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] total_range = 20971520 gamma = 1.000000 gamma' = 18.007186
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] Total additional memory over dynamic_min = 377638052 KiB; will set gamma = 1.00 (leaving unallocated 356666532 KiB)
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] free_memory_range ideal target = 4296680
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] change_host_free_memory required_mem = 4305896 KiB target_mem = 9216 KiB free_mem = 371880116 KiB
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||squeeze] change_host_free_memory all VM target meet true
Mar 11 14:26:06 xcpng02 squeezed: [debug||253 ||memory] reserved 4296680 kib for reservation 6f0f8c43-7ffa-ffbf-7723-a1be3c1a61d1
Mar 11 14:26:06 xcpng02 squeezed: [debug||254 ||squeeze_xen] Xenctrl.domain_setmaxmem domid=53 max=4297704 (was=0)
Mar 11 14:26:13 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] Adding watches for domid: 53
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] Removing watches for domid: 52
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/initial-reservation <- 4296680
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/target <- 4196352
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /control/feature-balloon <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/memory-offset <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/uncooperative <- None
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/dynamic-min <- 4194304
Mar 11 14:26:19 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/dynamic-max <- 10485760
Mar 11 14:26:21 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1
Mar 11 14:26:22 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1
Mar 11 14:26:25 xcpng02 squeezed: [debug||3 ||squeeze_xen] domid 53 just started a guest agent (but has no balloon driver); calibrating memory-offset = 2024 KiB
Mar 11 14:26:25 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /memory/memory-offset <- 2024
Mar 11 14:26:25 xcpng02 squeezed: [debug||3 ||squeeze_xen] Xenctrl.domain_setmaxmem domid=53 max=4199400 (was=4297704)
Mar 11 14:26:28 xcpng02 squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1

I hope the filtered xensource.log (filtered for "squeezed") is enough or do you need other events aswell @andriy.sultanov ?

MajorP93

@MajorP93

Best posts made by MajorP93

Latest posts made by MajorP93