Posts made by tmk | XCP-ng and XO forum

tmk

Yes, we monitor observability metrics for guest OSes via other means but being able to see this info via the Xen Orchestra web UI along with the other metrics in the Stats tab definitely has is benefits, especially when performing initial troublehshooting or when dealing with groups in our org that only have access to the XO interface and not the other metrics dashboards.

Don't get me wrong, I definitely appreciate all of the work that you are all doing to get these tools working and can't wait to start using them in production. I just need to make sure that they work as expected.

tmk

@olivierlambert Ok thanks, good to know that there's a patch in the works. I'll wait for that before further testing. The inability to see memory usage in the stats is one of those things that will need to be resolved before we can switch from the XenServer VM Tools.

tmk

I've been testing out the new XCP-ng Windows PV drivers (version 9.0.9137) and I've run across what appears to be a bug.

The tools status shows normally in Xen Orchestra for a VM with the tools installed but if I migrate a VM to a different host Xen Orchestra seems to lose that connection. After migration it shows a status of "Management agent not detected". The only thing that can bring it back to recognizing that the tools are installed is a reboot of the guest. I've tried restarting all 4 of the xcp-ng/xen related Windows services and that doesn't help either, only a reboot.

I've tested this on a Windows Server 2022 VM that had the XenServer VM Tools removed with XenClean and then the new tools installed. I have also tested it with a fresh install of Windows Server 2019 and only the new tools installed. Finally, I have confirmed the same behavior with a Windows 11 VM that had the XenServer VM Tools removed with XenClean and then the new tools installed.

We are running XCP-ng with all of the latest updates installed and Xen Orchestra version 5.112.1

One of the side effects of the tools not being detected after a migration is that it no longer shows memory usage on the Stats tab (until a reboot of the VM). Here are some before/after migration screenshots of one of the servers with some of the changes highlighted.

General tab before migration:

General tab after migration:

Advanced tab before migration:

Advanced tab after migration:

Stats tab before migration:

Stats tab after migration:

Has anyone else seen this behavior?

tmk

I was wondering if anyone else is seeing this behavior from their backups in XOA. We've been experimenting with different settings in our backup jobs and just started enabling retries if a backup fails.

What I've found is that the retry works and the backup is successful but the overall job status still shows as failed and the individual VM backup also shows as failed.

I have confirmed that the backup ultimately succeeded as I can see it as an available restore point for the VM when I go to the restore tab in XO. We're currently running XOA version 5.112.1, not sure if this is a new issue though as we just started experimenting with the retry setting.

Normally if a backup fails, retries, then succeeds on retry I would expect a final status of success. Here's a screenshot of what I'm seeing:

tmk

@florent
Sorry for the late reply, yes, we had previously tried that setting and found that it did not provide any speed increase in our case. One thing to note is that we do have our backup jobs configured to merge backups synchronously, we're starting to test some of our jobs with that setting disabled.

We had originally turned it on because we experienced a lot of backup failures due to locking errors. We've since added additional proxies as we've found that the amount of data that a single proxy can backup in a nightly window was the primary bottleneck in our environment.

Since adding the additional proxies we've started disabling the synchronous backup merge for several of our jobs and so far it has been working pretty well and our backup times have been running faster (obviously since the final merge was the vast majority of the time that we observed in our backup steps).

tmk

Sorry in advance for the long-winded post but I am experiencing some long backup times and trying to understand where the bottleneck lies and if there’s anything that I could change to improve the situation.

We’re currently using backups through the XOA, backing up to remotes on a Dell Data Domain via NFS. The backup is configured to use an XOA backup proxy for this job to keep the load off of our main XOA.

As an example we have a delta backup job configured for a pool and it backs up about 100 VMs. We have our concurrency set to 16, we use NBD and changed block tracking and we merge backups synchronously. The last backup for this job took 15 hours and moved just over 2 TiB of data.

After examining the logs from this backup (downloading the json and converting to excel format for easier analysis) I found that there are 4 distinct phases for each VM backup: an initial clean, a snapshot, a transfer and a final clean. I have also found that the final clean phase takes by far the most amount of time on each backup.

The Initial Clean Duration time for each server was typically somewhere between a couple seconds and 30 seconds.

The Snapshot Duration was somewhere between 2-10 minutes per VM.

The Transfer Duration varied between a few seconds and around 30 minutes.

The Final Clean Duration however was anywhere between 25 minutes on the low end to almost 5 hours on the high end. The amount of time that this phase took was not proportional to the disk size of the vm being backed up or the transfer size for the backup. I found 2 VMs, each with a single 100GB hard disk and both moved around 20GB of changed data. One of them experienced a Final Clean Duration of 30 minutes and the other was 4 hours and 30 minutes in the same backup job.

We also have a large vmware infrastructure and use Dell Power Protect to backup the VMs there to the same Data Domain and we do not see similar issues with backup times in that system. So that got me thinking what the differences were between them and how some of those differences might be affecting the backup job duration.

One of the biggest differences that I could come up with was the fact that Power Protect uses the DDBoost protocol to communicate with the Data Domains whereas we had to create NFS exports from the Data Domain to use as backup remotes in Xen Orchestra.

Since DDBoost uses client side deduplication it significantly cuts down on the amount of data transferred to the Data Domain. But our transfer time wasn’t the bottleneck here, it was the final clean duration time.

This led me to investigate what is actually happening during this phase and please correct me if I’m wrong but it seems like when XO performs coalescing over NFS after the backup:

The coalescing process reads each modified block from the child VHD and writes it back to the parent VHD.

Over NFS, this means:

Read request travels to Data Domain
Data Domain reconstructs the deduplicated block (rehydration)
Full block data travels back to the proxy (or all the way back to the the xcp-ng host, I’m not entirely sure on this one)
xcp-ng processes the block
Full block data travels back to Data Domain
Data Domain deduplicates it again (often finding it's duplicate)

So it seems that the Data Domain must constantly rehydrate (reconstruct) deduplicated data for reads, only to immediately deduplicate the same data again on writes.

With DDBoost, it seems like this cycle doesn't happen because the client already knows what's unique.

So it seems that each write during coalescing potentially triggers:

Deduplication processing
Compression operations
Copy-on-write overhead for already deduplicated blocks

This happens for every block during coalescing, even though most blocks haven't actually changed.

So I guess I have a few questions. Is anyone else using NFS to a Data Domain as a backup target for backups in Xen Orchestra and if so have you seen the same kind of performance?

For others that backup to a target device that doesn’t handle inline dedup and compression do you see the same or better performance from your backup job times?

Does Vates have any plans to incorporate the DDBoost library as an option for the supported protocols when connecting a backup remote?

Is there any expectation that the qcow2 disk format could help with this at all vs vhd format?

tmk

Sounds good thanks!

tmk

Does anyone know if there is a way to specify the cpu topology (the number of sockets and cores per socket) of a new VM when using Terraform for cloning from a VM template? From what I can tell the vatesfr/xenorchestra Terraform provider doesn't allow us to specify anything other than the total number of cpus.

The issue that I ran into is when deploying new VMs, they will inherit the number of cores per socket from the template and increase the number of sockets to match the number of cpus specified. This has caused some weird performance issues in the underlying guest OSes where they see a large number of sockets and a small number of cores per socket. I suspect this has something to do with NUMA and CPU scheduling in the guest but not 100% sure. Either way, reducing the sockets and increasing the cores fixed the performance issues.

As a work-around I tried setting the cpu topology for the template to be 1 socket and the actual number of cores for a single socket that exist on the host (in my case it was 28 cores). This got me closer as the new VM that was deployed from template had only 1 socket and the correct number of VCPUs-at-startup specified but the cpu topology of the new VM still showed 1 socket, 28 cores per socket and 8/28 for CPU limits. This seems to indicate that CPU hotadd was enabled with the VCPUs-max set to 28 for the VM.

I guess my question around this would be does anyone know if there are any performance penalties for configuring different VCPUs-at-startup and VCPUs-max values? I know enabling CPU hotadd in VMWare incurres up to a 10% performance penalty. I also found in this scenario that if I tried to deploy a vm from that same template with more than 28 cores it would fail with the error: Not a divisor of the VMs max CPUs.

It would be nice to just to specify a cpu topology for a new vm during deployment. This issue has turned what was a completely automated deployment into one that now requires manual intervention for each VM that's being deployed.

If there is no other answer does anyone know the best place to submit feature requests for the Xen Orchestra terraform provider?

tmk

@Pilow I was using Powershell 7. If you are using a self-signed certificate on the XOA server then you'll have issues as the -SkipCertificateCheck parameter isn't available for the Invoke-RestMethod command in Powershell 5.

tmk

I recently developed a PowerShell script that fills a need that I couldn't find with the built-in reporting options for Xen Orchestra backups.

The script is available on GitHub, It works with PowerShell 7 (if using self-signed certs and using the SkipCertificateCheck parameter), otherwise it may work with Windows Powershell 5, I just didn't get a chance to test that yet.

Repository: https://github.com/codekeller/XO-PS-Scripts/tree/5ed1ce9915a41b266af6db54e29236e6f4265143/Xen Orchestra Backup Report

The script connects to your Xen Orchestra server via REST API and generates HTML reports that combine:

Complete backup job inventory - Shows all configured VM backups, metadata backups, and replication jobs

Execution status correlation - Matches job definitions with recent execution history to show what's actually running vs. what's configured

Professional reporting - Clean HTML output with a list of all configured backups jobs and their most recently run status within the last 24 hours (configurable)

Automated delivery - Optional email integration for scheduled reporting

Why I Built This

Managing backup jobs across multiple pools and sites, I found myself constantly logging into XO to check backup status and manually correlating job definitions with execution results. This script automates that entire process and provides the kind of professional reports that management actually wants to see.

Sample Output

The reports include a summary showing success/failure counts, job definitions organized by type with current status, and detailed execution logs with timing information. The HTML is optimized for both email delivery and web viewing.

The README includes detailed documentation and real-world usage examples. I've also included sample reports so you can see exactly what output to expect.

Getting Started

Basic usage is straightforward with additional options available:

.\Get-XenOrchestraBackupReport.ps1 -XenOrchestraUrl "https://xo.company.com" -ApiToken "your-token" -OutputPath "backup-report.html"

Would love feedback from the community

tmk

@MK.ultra I'd say make sure that your network-config is formatted properly with the correct YAML formatting (posting the content on this forum strips some of the formatting). Also, verify that you're using the correct network adapter name, "Ethernet 2" was just the name of the NIC I was using in my template.

If that's not an issue the next step would be to check the cloudbase-init logs to see what's reported there. You may need to enable debug logging in the cloudbase config file to get all of the relevant info depending on the issue.

tmk

@jkatz

I ran into the same issue when trying to configure a network adapter with cloudbase-init. The documentation says that the MAC address value is optional but in reality it is required.

In my case I want Xen Orchestra to choose a unique MAC during deployment, and I since I am deploying from a template, the NIC name is a known value. The fix that ended up working for me was to modify the networkconfig.py file in cloudbase-init so that the NIC name is required and the MAC address is optional.

I ended up making some additional changes to allow for the network-config v2 format along with the existing v1 support and some additional logic to aid in setting the dns search domains (I can't recall if this was originally supported or not but I had issues getting it to work with the original networkconfig.py file)

This file needs to replace the existing one that is installed in the C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\Lib\site-packages\cloudbaseinit\plugins\common\ directory. To replace it you should make sure that the cloudbase-init service is stopped, then replace the file.

Once this file is replaced, delete the pycache folder in the same parent folder as networkconfig.py - this will ensure that python recompiles this file on service start. Start the cloudbase-init service and confirm that you see a new pycache get created.

I'm not a python programmer by trade so others may be able to point out areas for improvement but this ended up working for me and I wanted to share in case it could help others needing to deploy new servers without manually specifying a new MAC address. Below is an example network-config v2 format that works with the updated file.

version: 2
ethernets:
Ethernet 2:
dhcp4: false
addresses:
- 10.20.30.10/24
nameservers:
addresses:
- 10.20.5.12
- 10.20.5.13
- 10.20.5.14
search:
- intranet.domain.org
- domain.org
- public-domain.org
routes:
- to: default
via: 10.20.30.1

Updated networkconfig.py file:
networkconfig.py.txt