Best posts made by MajorP93 | XCP-ng and XO forum

MajorP93

Hello XCP-ng community!

Since Vates released the new OpenMetrics plugin for Xen Orchestra we now have an official, built-in exporter for Prometheus metrics!

I was using xen-exporter before in order to make hypervisor internal RRD database available in the form of Prometheus metrics.
I migrated to the new plugin which works just fine.

I updated the Grafana dashboard that I was using in order to be compatible with the official OpenMetrics plugin and thought "why not share it with other users"?

In case you are interested you can find my dashboard JSON here: https://gist.github.com/MajorP93/3a933a6f03b4c4e673282fb54a68474b

It is based on the xen-exporter dashboard made by MikeDombo: https://grafana.com/grafana/dashboards/16588-xen/

In case you also use Prometheus for scraping Xen Orchestra OpenMetrics plugin in combination with Grafana you can copy the JSON from my gist, import it and you are ready to go!

Hope it helps!

Might even be a good idea to include the dashboard as an example in the Xen Orchestra documentation.

Best regards

MajorP93

@MathieuRA I disabled Traefik and reverted to my old XO config (port 443, ssl encryption, http to https redirection), rebuild the docker container using your branch and tested:

it is working fine on my end now

Thank you very much!

I did not expect this to get fixed so fast!

MajorP93

@Pilow said in backup mail report says INTERRUPTED but it's not ?:

@MajorP93 you say to have 8GB Ram on XO, but it OOMkills at 5Gb Used RAM.

did you do those additionnal steps in your XO Config ?

You can increase the memory allocated to the XOA VM (from 2GB to 4GB or 8GB).
Note that simply increasing the RAM for the VM is not enough.
You must also edit the service file (/etc/systemd/system/xo-server.service) 
to increase the memory allocated to the xo-server process itself.

You should leave ~512MB for the debian OS itself. Meaning if your VM has 4096MB total RAM, you should use 3584 for the memory value below.

- ExecStart=/usr/local/bin/xo-server
+ ExecStart=/usr/local/bin/node --max-old-space-size=3584 /usr/local/bin/xo-server
The last step is to refresh and restart the service:

$ systemctl daemon-reload
$ systemctl restart xo-server

Interesting!
I did not know that it is recommended to set "--max-old-space-size=" as a startup parameter for Node JS with the result of (total system ram - 512MB).
I added that, restarted XO and my backup job.

I will test if that gives my backup jobs more stability.
Thank you very much for taking the time and recommending the parameter.

MajorP93

@Mang0Musztarda said in Xen Orchestra OpenMetrics Plugin - Grafana Dashboard:

@MajorP93 hi, how can i scrape openmetrics endpoint?
i set up openmetrics plugin prometheus secret, enabled it, and ten tried to use curl like that: curl -H "Authorization: Bearer abc123" http://localhost:9004
but response i got was
{"error":"Query authentication does not match server setting"}
what am i doing wrong?

Hey!
I scrape it like so:

root@prometheus01:~# cat /etc/prometheus/scrape_configs/xen-orchestra-openmetrics.yml 
scrape_configs:
  - job_name: xen-orchestra
    honor_labels: true
    scrape_interval: 30s
    scrape_timeout: 20s
    scheme: https
    tls_config:
      insecure_skip_verify: true
    bearer_token_file: /etc/prometheus/bearer.token
    metrics_path: /openmetrics/metrics
    static_configs:
    - targets:
      - xen-orchestra.domain.local

/etc/prometheus/bearer.token file contains the bearer token as configured in openmetrics xen orchestra plugin.

best regards

MajorP93

Can also confirm that I was able to apply this round of patches using rolling update method without any issues or slowdowns on a pool of 5 hosts.

MajorP93

@rzr Thank you very much!

@michmoor0725 Absolutely! The community is another aspect of why working with XCP-ng is a lot more fun compared to working with VMWare!

MajorP93

@florent said in [VDDK V2V] Migration of VM that had more than 1 snapshot creates multiple VHDs:

@MajorP93 the size are different between the disks, did you modify it since the snapshots ?

would it be possible to take one new snapshot with the same disk structure ?

Sorry it was my bad indeed.
On the VMWare side there are 2 VMs that have almost the exact same name.
When I checked for disk layout to verify this was an issue I looked at the wrong VM.

I checked again and can confirm that the VM in question has 1x 60GiB and 1x 25GiB VMDK.

So this is not an issue. It is working as intended.

Thread can be closed / deleted.
Sorry again and thanks for the replies.

Best regards
MajorP

MajorP93

said in Xen Orchestra Node 24 compatibility:

After moving from Node 22 to Node 24 on my XO instance I started to see more "Error: ENOMEM: not enough memory, close" for my backup jobs even though my XO VM has 8GB of RAM...

I will revert back to Node 22 for now.

I did some further troubleshooting and was able to pinpoint it down to SMB encryption on Xen Orchestra backup remotes ("seal" CIFS mount flag).
"ENOMEM" errors seem to occur only when I enable previously explained option.
Seems to be related to some buffering that is controlled by Linux kernel CIFS implementation that is failing when SMB encryption is being used.
CIFS operation gets killed due to buffer exhaustion caused by encryption and Xen Orchestra shows "ENOMEM".
Somehow this issue gets more visible when using Node 24 vs Node 22 which is why I thought it was caused by the Node version + XO version combination. I switched Node version at the same time I enabled SMB encryption.
However this seems to be not directly related to Xen Orchestra and more a Node / Linux kernel CIFS implementation thing.
Apparently not a Xen Orchestra bug per se.

MajorP93

@dom0 As already previously mentioned XCP-ng Center / XenCenter are not officially supported and a third-party product.
It is generally advised to use Xen Orchestra for all administration / management tasks.

If it is a requirement for you to use a thick client (such as XCP-ng Center) you might want to try XenAdminQt: https://github.com/benapetr/XenAdminQt

It is also not officially supported but a very new project that gets updated frequently. Maybe that one works better for you.

MajorP93

Hey,
small update:
while adding the backup section and "diskPerVmConcurrency" option to "/etc/xo-server/config.diskConcurrency.toml" or "~/.config/xo-server/config.diskConcurrency.toml" had no effect for me, I was able to get this working by adding it at the end of my main XO config file at "/etc/xo-server/config.toml".

Best regards

MajorP93

I worked around this issue by changing my full backup job to "delta backup" and enabling "force full backup" in the schedule options.

Delta backup seems more reliable as of now.

Looking forward to a fix as Zstd compression is an appealing feature of the full backup method.

MajorP93

I can imagine that a fix could be to send "keepalive" packets in addition to the XCP-ng export-VM-data-stream so that the timeout on XO side does not occur

MajorP93

@florent Thanks for the reply. I will rebuild XO and test with current master.

MajorP93

@acebmxer Okay I see. Yeah that correlates with @florent 's observation that the problem was not visible in a small test environment. Only during high backup load the problem really becomes visible. (In my case 106 VMs, backup jobs running in parallel)

MajorP93

This is the RAM usage of my XO CE instance (Debian 13, Node 24, XO commit fa110ed9c92acf03447f5ee3f309ef6861a4a0d4 / "feat: release 6.1.0")

Metrics are exported via XO openmetrics plugin.

At the spots in the graph where my XO CE instance used around 2GB of RAM it was freshly restarted.
Between 31.01. and 03.02. you can see the RAM usage climbing and climbing until my backup jobs went into "interrupted" status on 03.02. due to Node JS heap issue as described in my error report in post https://xcp-ng.org/forum/post/102160.

MajorP93

Hello XCP-ng community and Vates-Team,

I just observed a weird behavior of Xen Orchestra during backup file restore.

Background: I had to restore a directory that got deleted on a small file server Windows VM by accident.

I used Xen Orchestra's file restore menu to select the VM, restore point and path of the directory in question.
Initially I selected .tar.gz as export format and started the restore process.
A new browser tab opened and after a few minutes it showed "Error proxying request".
Then Xen Orchestra became almost fully unresponsive for like 5min but started to behave normal again after said time.

I then tried the same thing again: selected same VM, restore point, path etc. but this time opted for ".zip (slow)" option as export format.
That worked without any issues. Download started after like 5 seconds, no issues whatsoever.

Did somebody else encounter similar issues?
Maybe the .tar.gz functionality of Xen Orchestra needs investigation.
Just wanted to report this issue and ask if maybe somebody else encountered it.

Thanks and best regards

//EDIT: oh forgot to mention: I am running a fully patched XCP-ng 8.3 pool and latest XO CE on a Debian 13 VM. NodeJS version is 24 LTS.

MajorP93

@magicker said in "NOT_SUPPORTED_DURING_UPGRADE()" error after yesterday's update:

@olivierlambert said in "NOT_SUPPORTED_DURING_UPGRADE()" error after yesterday's update:

Because doing an update without rebooting doesn't reload the updated main programs, like XAPI. A host in only updated after a full reboot.

Reply

Hi there
Is it just me or is this a chicken and egg situation.

you upgrade the master... how the pool is in NOT_SUPPORTED_DURING_UPGRADE() stage. You cant move vms off the master so all you can do is close down vms.. reboot.. pray

then move the a non master.. you cant move the vms off here either NOT_SUPPORTED_DURING_UPGRADE(). So you have do the same..

needless to say I hit issues on each reboot which caused 30- 60 min delays in getting vms back up and running.

can you Warm migrate or is this dead also (to scared to test)

For me this workflow worked every time there were upgrades available:

-disable HA on pool level
-disable load balancer plugin
-upgrade master
-upgrade all other nodes
-restart toolstack on master
-restart toolstack on all other nodes
-live migrate all VMs running on master to other node(s)
-reboot master
-reboot next node (live migrate all VMs running on that particular node away before doing so)
-repeat until all nodes have been rebooted (one node at a time)
-re-enable HA on pool level
-re-enable load balancer plugin

Never had any issues with that. No downtime for none of the VMs.

MajorP93

@andriy.sultanov said in Potential bug with Windows VM backup: "Body Timeout Error":

xe-toolstack-restart

Okay I was able to replicate the issue.
This is the setup that I used and that resulted in the "body timeout error" previously discussed in this thread:

OS: Windows Server 2019 Datacenter

The versions of the packages in question that were used in order to replicate the issue (XCP-ng 8.3, fully upgraded):

[11:58 dat-xcpng-test01 ~]# rpm -q xapi-core
xapi-core-25.27.0-2.2.xcpng8.3.x86_64
[11:59 dat-xcpng-test01 ~]# rpm -q qcow-stream-tool
qcow-stream-tool-25.27.0-2.2.xcpng8.3.x86_64
[11:59 dat-xcpng-test01 ~]# rpm -q vhd-tool
vhd-tool-25.27.0-2.2.xcpng8.3.x86_64

Result:

Backup log:

{
  "data": {
    "mode": "full",
    "reportWhen": "failure"
  },
  "id": "1764585634255",
  "jobId": "b19ed05e-a34f-4fab-b267-1723a7195f4e",
  "jobName": "Full-Backup-Test",
  "message": "backup",
  "scheduleId": "579d937a-cf57-47b2-8cde-4e8325422b15",
  "start": 1764585634255,
  "status": "failure",
  "infos": [
    {
      "data": {
        "vms": [
          "36c492a8-e321-ef2b-94dc-a14e5757d711"
        ]
      },
      "message": "vms"
    }
  ],
  "tasks": [
    {
      "data": {
        "type": "VM",
        "id": "36c492a8-e321-ef2b-94dc-a14e5757d711",
        "name_label": "Win2019_EN_DC_TEST"
      },
      "id": "1764585635692",
      "message": "backup VM",
      "start": 1764585635692,
      "status": "failure",
      "tasks": [
        {
          "id": "1764585635919",
          "message": "snapshot",
          "start": 1764585635919,
          "status": "success",
          "end": 1764585644161,
          "result": "0f548c1f-ce5c-56e3-0259-9c59b7851a17"
        },
        {
          "data": {
            "id": "f1bc8d14-10dd-4440-bb1d-409b91f3b550",
            "type": "remote",
            "isFull": true
          },
          "id": "1764585644192",
          "message": "export",
          "start": 1764585644192,
          "status": "failure",
          "tasks": [
            {
              "id": "1764585644201",
              "message": "transfer",
              "start": 1764585644201,
              "status": "failure",
              "end": 1764586308921,
              "result": {
                "name": "BodyTimeoutError",
                "code": "UND_ERR_BODY_TIMEOUT",
                "message": "Body Timeout Error",
                "stack": "BodyTimeoutError: Body Timeout Error\n    at FastTimer.onParserTimeout [as _onTimeout] (/opt/xo/xo-builds/xen-orchestra-202511080402/node_modules/undici/lib/dispatcher/client-h1.js:646:28)\n    at Timeout.onTick [as _onTimeout] (/opt/xo/xo-builds/xen-orchestra-202511080402/node_modules/undici/lib/util/timers.js:162:13)\n    at listOnTimeout (node:internal/timers:588:17)\n    at process.processTimers (node:internal/timers:523:7)"
              }
            }
          ],
          "end": 1764586308922,
          "result": {
            "name": "BodyTimeoutError",
            "code": "UND_ERR_BODY_TIMEOUT",
            "message": "Body Timeout Error",
            "stack": "BodyTimeoutError: Body Timeout Error\n    at FastTimer.onParserTimeout [as _onTimeout] (/opt/xo/xo-builds/xen-orchestra-202511080402/node_modules/undici/lib/dispatcher/client-h1.js:646:28)\n    at Timeout.onTick [as _onTimeout] (/opt/xo/xo-builds/xen-orchestra-202511080402/node_modules/undici/lib/util/timers.js:162:13)\n    at listOnTimeout (node:internal/timers:588:17)\n    at process.processTimers (node:internal/timers:523:7)"
          }
        },
        {
          "id": "1764586443440",
          "message": "clean-vm",
          "start": 1764586443440,
          "status": "success",
          "end": 1764586443459,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1764586443624",
          "message": "snapshot",
          "start": 1764586443624,
          "status": "success",
          "end": 1764586451966,
          "result": "c3e9736e-d6eb-3669-c7b8-f603333a83bf"
        },
        {
          "data": {
            "id": "f1bc8d14-10dd-4440-bb1d-409b91f3b550",
            "type": "remote",
            "isFull": true
          },
          "id": "1764586452003",
          "message": "export",
          "start": 1764586452003,
          "status": "success",
          "tasks": [
            {
              "id": "1764586452008",
              "message": "transfer",
              "start": 1764586452008,
              "status": "success",
              "end": 1764586686887,
              "result": {
                "size": 10464489322
              }
            }
          ],
          "end": 1764586686900
        },
        {
          "id": "1764586690122",
          "message": "clean-vm",
          "start": 1764586690122,
          "status": "success",
          "end": 1764586690140,
          "result": {
            "merge": false
          }
        }
      ],
      "warnings": [
        {
          "data": {
            "attempt": 1,
            "error": "Body Timeout Error"
          },
          "message": "Retry the VM backup due to an error"
        }
      ],
      "end": 1764586690142
    }
  ],
  "end": 1764586690143
}

I then enabled your test repository and installed the packages that you mentioned:

[12:01 dat-xcpng-test01 ~]# rpm -q xapi-core
xapi-core-25.27.0-2.3.0.xvafix.1.xcpng8.3.x86_64
[12:08 dat-xcpng-test01 ~]# rpm -q vhd-tool
vhd-tool-25.27.0-2.3.0.xvafix.1.xcpng8.3.x86_64
[12:08 dat-xcpng-test01 ~]# rpm -q qcow-stream-tool
qcow-stream-tool-25.27.0-2.3.0.xvafix.1.xcpng8.3.x86_64

I restarted tool-stack and re-ran the backup job.
Unfortunately it did not solve the issue and made the backup behave very strangely:

The backup job ran only a few seconds and reported that it was "successful". But only 10.83KiB were transferred. There are 18GB used space on this VM. So the data unfortunately was not transferred by the backup job.

Here is the backup log:

{
  "data": {
    "mode": "full",
    "reportWhen": "failure"
  },
  "id": "1764586964999",
  "jobId": "b19ed05e-a34f-4fab-b267-1723a7195f4e",
  "jobName": "Full-Backup-Test",
  "message": "backup",
  "scheduleId": "579d937a-cf57-47b2-8cde-4e8325422b15",
  "start": 1764586964999,
  "status": "success",
  "infos": [
    {
      "data": {
        "vms": [
          "36c492a8-e321-ef2b-94dc-a14e5757d711"
        ]
      },
      "message": "vms"
    }
  ],
  "tasks": [
    {
      "data": {
        "type": "VM",
        "id": "36c492a8-e321-ef2b-94dc-a14e5757d711",
        "name_label": "Win2019_EN_DC_TEST"
      },
      "id": "1764586966983",
      "message": "backup VM",
      "start": 1764586966983,
      "status": "success",
      "tasks": [
        {
          "id": "1764586967194",
          "message": "snapshot",
          "start": 1764586967194,
          "status": "success",
          "end": 1764586975429,
          "result": "ebe5c4e2-5746-9cb3-7df6-701774a679b5"
        },
        {
          "data": {
            "id": "f1bc8d14-10dd-4440-bb1d-409b91f3b550",
            "type": "remote",
            "isFull": true
          },
          "id": "1764586975453",
          "message": "export",
          "start": 1764586975453,
          "status": "success",
          "tasks": [
            {
              "id": "1764586975473",
              "message": "transfer",
              "start": 1764586975473,
              "status": "success",
              "end": 1764586981992,
              "result": {
                "size": 11093
              }
            }
          ],
          "end": 1764586982054
        },
        {
          "id": "1764586985271",
          "message": "clean-vm",
          "start": 1764586985271,
          "status": "success",
          "end": 1764586985290,
          "result": {
            "merge": false
          }
        }
      ],
      "end": 1764586985291
    }
  ],
  "end": 1764586985292
}

If you need me to test something else or if I should provide some log file from the XCP-ng system please let me know.

Best regards

MajorP93

@andriy.sultanov I created a small test setup in our lab. I created a WIndows VM with a lot of free disk space (2 virtual disks, 2.5 TB free space in total). Hopefully that way I will be able to replicate the issue with full backup timeout for VMs with a lot of free space that occurred in our production environment.
The backup job is currently running. I will report back once it failed and once I had a chance to test if your fix solves the issue.

MajorP93

@wmazren I had a similar issue which costed my many hours to troubleshoot.

I'd advise you to check "dmesg" output within the VM that is not able to get live migrated.

XCP-ng / Xen behaves different than VMWare regarding live migration.

XCP-ng will interact with the linux kernel upon live migration and the kernel will try to freeze all processes before performing the live migration.

In my case a "fuse" process blocked the graceful freezing of all processes and my live migration task also stuck in task view similar to your case.

After solving the fuse process issue and therefore making the system able to live migrate the issue was gone.

All of this can be viewed in dmesg as the kernel will tell you about what is being done during live migration via XCP-ng.

//EDIT: another thing you might want to try is toggling "migration compression" in pool settings as well as making sure you have a dedicated connection / VLAN configured for the live migration. Those 2 things also helped my live migrations being faster and more robust.

Posts