borivoj-tydlitat

borivoj-tydlitat

We see occasional VM backup failures that match the above pattern (ENOENT on a "path" in a S3 bucket remote destination during a VM delta backup). We are running a backup of a large VM (200GB disk) to S3 bucket in Ceph Object Gateway. Our current XO is this commit https://github.com/vatesfr/xen-orchestra/commit/da51b1649c65d7d78a4eb25cc46c488ce2552800 from 2024-03-04.

The failure is intermittent.

Example failure log:

{
  "data": {
    "mode": "delta",
    "reportWhen": "always"
  },
  "id": "1712008868496",
  "jobId": "49c2634a-f281-46b7-9192-aea8c786f3d4",
  "jobName": "ZABBIX VM to ceph02 Delta Backup",
  "message": "backup",
  "scheduleId": "6b1280f0-21e7-475c-9af3-330726108568",
  "start": 1712008868496,
  "status": "failure",
  "infos": [
    {
      "data": {
        "vms": [
          "c020890b-c1a7-79cf-0b55-aa340ba9226b"
        ]
      },
      "message": "vms"
    }
  ],
  "tasks": [
    {
      "data": {
        "type": "VM",
        "id": "c020890b-c1a7-79cf-0b55-aa340ba9226b",
        "name_label": "zabbix"
      },
      "id": "1712008870266",
      "message": "backup VM",
      "start": 1712008870266,
      "status": "failure",
      "tasks": [
        {
          "id": "1712008870283",
          "message": "clean-vm",
          "start": 1712008870283,
          "status": "success",
          "end": 1712008872364,
          "result": {
            "merge": false
          }
        },
        {
          "id": "1712008872791",
          "message": "snapshot",
          "start": 1712008872791,
          "status": "success",
          "end": 1712008873901,
          "result": "902d65f0-8c9b-2460-dad7-d041ceefebdc"
        },
        {
          "data": {
            "id": "d8d6a743-8c15-4b9a-baf4-df5d7d379b6e",
            "isFull": false,
            "type": "remote"
          },
          "id": "1712008873919",
          "message": "export",
          "start": 1712008873919,
          "status": "failure",
          "tasks": [
            {
              "id": "1712008875215",
              "message": "transfer",
              "start": 1712008875215,
              "status": "success",
              "end": 1712009333076,
              "result": {
                "size": 35343973376
              }
            },
            {
              "id": "1712009335152",
              "message": "clean-vm",
              "start": 1712009335152,
              "status": "failure",
              "tasks": [
                {
                  "id": "1712009336153",
                  "message": "merge",
                  "start": 1712009336153,
                  "status": "failure",
                  "end": 1712009585029,
                  "result": {
                    "cause": {
                      "name": "NoSuchKey",
                      "$fault": "client",
                      "$metadata": {
                        "httpStatusCode": 404,
                        "requestId": "tx00000c27bed21222dc684-00660b3170-302aa3c-s3",
                        "attempts": 1,
                        "totalRetryDelay": 0
                      },
                      "Code": "NoSuchKey",
                      "BucketName": "xcp-ng-mama-pool-vm-backup",
                      "RequestId": "tx00000c27bed21222dc684-00660b3170-302aa3c-s3",
                      "HostId": "302aa3c-s3-s3",
                      "message": "UnknownError"
                    },
                    "code": "ENOENT",
                    "path": "/xo-vm-backups/c020890b-c1a7-79cf-0b55-aa340ba9226b/vdis/49c2634a-f281-46b7-9192-aea8c786f3d4/378753c5-9072-47b7-b879-0488a86e85ba/data/b70bc690-28c7-440a-8147-bf6d61c179de.vhd/blocks/10/52",
                    "message": "ENOENT: no such file or directory '/xo-vm-backups/c020890b-c1a7-79cf-0b55-aa340ba9226b/vdis/49c2634a-f281-46b7-9192-aea8c786f3d4/378753c5-9072-47b7-b879-0488a86e85ba/data/b70bc690-28c7-440a-8147-bf6d61c179de.vhd/blocks/10/52'",
                    "name": "Error",
                    "stack": "Error: ENOENT: no such file or directory '/xo-vm-backups/c020890b-c1a7-79cf-0b55-aa340ba9226b/vdis/49c2634a-f281-46b7-9192-aea8c786f3d4/378753c5-9072-47b7-b879-0488a86e85ba/data/b70bc690-28c7-440a-8147-bf6d61c179de.vhd/blocks/10/52'\n    at S3Handler._copy (/opt/xo/xo-builds/xen-orchestra-202403041050/@xen-orchestra/fs/dist/s3.js:157:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
                  }
                }
              ],
              "end": 1712009585029,
              "result": {
                "cause": {
                  "name": "NoSuchKey",
                  "$fault": "client",
                  "$metadata": {
                    "httpStatusCode": 404,
                    "requestId": "tx00000c27bed21222dc684-00660b3170-302aa3c-s3",
                    "attempts": 1,
                    "totalRetryDelay": 0
                  },
                  "Code": "NoSuchKey",
                  "BucketName": "xcp-ng-mama-pool-vm-backup",
                  "RequestId": "tx00000c27bed21222dc684-00660b3170-302aa3c-s3",
                  "HostId": "302aa3c-s3-s3",
                  "message": "UnknownError"
                },
                "code": "ENOENT",
                "path": "/xo-vm-backups/c020890b-c1a7-79cf-0b55-aa340ba9226b/vdis/49c2634a-f281-46b7-9192-aea8c786f3d4/378753c5-9072-47b7-b879-0488a86e85ba/data/b70bc690-28c7-440a-8147-bf6d61c179de.vhd/blocks/10/52",
                "message": "ENOENT: no such file or directory '/xo-vm-backups/c020890b-c1a7-79cf-0b55-aa340ba9226b/vdis/49c2634a-f281-46b7-9192-aea8c786f3d4/378753c5-9072-47b7-b879-0488a86e85ba/data/b70bc690-28c7-440a-8147-bf6d61c179de.vhd/blocks/10/52'",
                "name": "Error",
                "stack": "Error: ENOENT: no such file or directory '/xo-vm-backups/c020890b-c1a7-79cf-0b55-aa340ba9226b/vdis/49c2634a-f281-46b7-9192-aea8c786f3d4/378753c5-9072-47b7-b879-0488a86e85ba/data/b70bc690-28c7-440a-8147-bf6d61c179de.vhd/blocks/10/52'\n    at S3Handler._copy (/opt/xo/xo-builds/xen-orchestra-202403041050/@xen-orchestra/fs/dist/s3.js:157:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
              }
            }
          ],
          "end": 1712009585031
        }
      ],
      "end": 1712009585032
    }
  ],
  "end": 1712009585032
}

I know I should be on the current XO to report a problem. Not asking for a fix now, just trying to document that this defect still exists.

0

julien-f committed to vatesfr/xen-orchestra

fix(xo-server/robots.txt): must not require authentication

Introduced by a2c36c083

This file is intended for search engines, which, by definition are not signed in.

borivoj-tydlitat

@stormi and @olivierlambert thank you for your advice.

I did some exploration on my side, too, and I think we have two workable strategies:

Use reposync mirror for xcp-ng-base and xcp-ng-updates repos on a shared filesystem visible by all hosts. Sync it, update the master, stop syncing and gradually update the remaining hosts.
Use a variation of the rpm -qa-based approach discussed earlier - update the master, collect the package state with rpm -qa > reference.pkglist, for each of the remaining hosts yum upgrade-to $(cat reference.pkglist), check with yum check-update or yum --assumeno upgrade for any irregularities, e.g. due to packages installed on some hosts only, and resolve these manually.

That's a good point about a pool in heterogeneous state too long - we will definitely reconsider our maintenance procedures.
We will try this approach in our upcoming maintenance and report here how we fared.

borivoj-tydlitat

Of course, we can also try negotiating the expansion of the window with the business / management, so that the entire pool update fits in it. But it may not solve the entire problem, as there are various other preparation processes that need to happen between the maintenance windows. Also, breaking a pool is inconvenient - it complicates the management and reduces the options to move VMs around (we make limited, but essential use of that). I am asking here, hoping we can find some technical solution within the existing XCP-ng features.

borivoj-tydlitat

Hi @olivierlambert - the reason is that we typically bundle the physical host reboot with other updates (e.g. host firmware, SW running in the host's VMs). Also, the software stack running in the VMs on the hosts often requires special care when shutting down (for example Kubernetes node VMs running production workloads where some components are a bit fragile, Ceph filesystem, which is HA, but may take long time to recover after a node is taken down etc.) In many cases, we also cannot use VM migration - epecially for VMs using large local storage. So far, the procedure has been scheduling a 2-hour maintenance window every week, which typically allows us to update 2-3 hosts. I have read this post https://xcp-ng.org/forum/topic/7200/patching-to-a-specific-version/4 , but digging into the behavior of yum update, it looks like it cannot update to a specific version (unlike yum install).

borivoj-tydlitat

Hello the XCP-ng community,

We are running XCP-ng on 9 hosts in 3 pools. To maintain continuous operation of the cluster, we perform rolling updates for security and other fixes, one host at a time, in a weekly maintenance window. The whole process typically takes about 3-4 windows, i.e. spans over 2-3 weeks. If a new update is published during that time, version skew can occur between some of the components installed on the hosts, and it has already happened to us that the skew resulted in a disruption of the cluster operation - specifically, VM backup via XenOrchestra stopped working. (And yes, we did follow the documentation on upgrading the pool master first.)

Is there some good practice for this kind of scenario, to make sure that the update cycle will result in consistent versions installed across the cluster? I can imagine that one could record the package versions installed by yum upgrade on the first host and then script the update on the subsequent hosts to use the same package versions, but maybe there is a better way?

Thank you.

borivoj-tydlitat

@borivoj-tydlitat

Latest posts made by borivoj-tydlitat