XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Backups started to fail again (overall status: failure, but both snapshot and transfer returns success)

    Scheduled Pinned Locked Moved Backup
    12 Posts 3 Posters 149 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • P Offline
      peo
      last edited by peo

      Got these backup failures again. Usually only the "Docker" VM, but now all backups gives the status as mentioned in the topic. Below is one of the examples.
      I have not updated XenOrchestra in a "long" time, I'm on c8f9d81 which was current at 3rd of July.
      My hosts are fully updated, as well as the VM running XO.
      The first non-Docker-VM failure appeared before I updated the hosts.
      Anything you want to investigate, or should I just update XO and hope for these errors to stop ?

      {
        "data": {
          "mode": "delta",
          "reportWhen": "failure"
        },
        "id": "1753140173983",
        "jobId": "38f0068f-c124-4876-85d3-83f1003db60c",
        "jobName": "HomeAssistant",
        "message": "backup",
        "scheduleId": "dcb1c759-76b8-441b-9dc0-595914e60608",
        "start": 1753140173983,
        "status": "failure",
        "infos": [
          {
            "data": {
              "vms": [
                "ed4758f3-de34-7a7e-a46b-dc007d52f5c3"
              ]
            },
            "message": "vms"
          }
        ],
        "tasks": [
          {
            "data": {
              "type": "VM",
              "id": "ed4758f3-de34-7a7e-a46b-dc007d52f5c3",
              "name_label": "HomeAssistant"
            },
            "id": "1753140251984",
            "message": "backup VM",
            "start": 1753140251984,
            "status": "failure",
            "tasks": [
              {
                "id": "1753140251993",
                "message": "clean-vm",
                "start": 1753140251993,
                "status": "success",
                "end": 1753140258038,
                "result": {
                  "merge": false
                }
              },
              {
                "id": "1753140354122",
                "message": "snapshot",
                "start": 1753140354122,
                "status": "success",
                "end": 1753140356461,
                "result": "fc6d5d87-a2b5-cae9-8c2a-377ffff5febc"
              },
              {
                "data": {
                  "id": "2b919467-704c-4e35-bac9-2d6a43118bda",
                  "isFull": false,
                  "type": "remote"
                },
                "id": "1753140356462",
                "message": "export",
                "start": 1753140356462,
                "status": "failure",
                "tasks": [
                  {
                    "id": "1753140359386",
                    "message": "transfer",
                    "start": 1753140359386,
                    "status": "success",
                    "end": 1753140753378,
                    "result": {
                      "size": 5630853120
                    }
                  },
                  {
                    "id": "1753140761602",
                    "message": "clean-vm",
                    "start": 1753140761602,
                    "status": "failure",
                    "end": 1753140775782,
                    "result": {
                      "name": "InternalError",
                      "$fault": "client",
                      "$metadata": {
                        "httpStatusCode": 500,
                        "requestId": "D98294C01B729C95",
                        "extendedRequestId": "RDk4Mjk0QzAxQjcyOUM5NUQ5ODI5NEMwMUI3MjlDOTVEOTgyOTRDMDFCNzI5Qzk1RDk4Mjk0QzAxQjcyOUM5NQ==",
                        "attempts": 3,
                        "totalRetryDelay": 112
                      },
                      "Code": "InternalError",
                      "message": "Internal Error",
                      "stack": "InternalError: Internal Error\n    at throwDefaultError (/opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@smithy/smithy-client/dist-cjs/index.js:867:20)\n    at /opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@smithy/smithy-client/dist-cjs/index.js:876:5\n    at de_CommandError (/opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@aws-sdk/client-s3/dist-cjs/index.js:4952:14)\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async /opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20\n    at async /opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@aws-sdk/middleware-sdk-s3/dist-cjs/index.js:484:18\n    at async /opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@smithy/middleware-retry/dist-cjs/index.js:320:38\n    at async /opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@aws-sdk/middleware-sdk-s3/dist-cjs/index.js:110:22\n    at async /opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@aws-sdk/middleware-sdk-s3/dist-cjs/index.js:137:14\n    at async /opt/xo/xo-builds/xen-orchestra-202507041243/node_modules/@aws-sdk/middleware-logger/dist-cjs/index.js:33:22"
                    }
                  }
                ],
                "end": 1753140775783
              }
            ],
            "end": 1753140775783
          }
        ],
        "end": 1753140775784
      }
      
      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        Hi,

        We can't help on outdated XO commits, it's impossible to do so, see https://docs.xen-orchestra.com/community#report-a-bug. Please update to the latest one and report if you still have the issue, thanks!

        edit: note that the error is likely not an XO side here, we got a HTTP 500.

        P 1 Reply Last reply Reply Quote 0
        • P Offline
          peo @olivierlambert
          last edited by

          @olivierlambert Thanks, will update every machine and XO involved in the backup process, and possibly even the individual machines that fails. First failure on vm-cleanup was 15 July, that's a few days before I patched the hosts (as a part of troubleshooting and preventing further failures). Still these backups will (probably) be fully restorable (as I have tested out with the always-failing Docker vm)

          D 1 Reply Last reply Reply Quote 0
          • D Offline
            DustinB @peo
            last edited by

            @peo said in Backups started to fail again (overall status: failure, but both snapshot and transfer returns success):

            @olivierlambert Thanks, will update every machine and XO involved in the backup process, and possibly even the individual machines that fails. First failure on vm-cleanup was 15 July, that's a few days before I patched the hosts (as a part of troubleshooting and preventing further failures). Still these backups will (probably) be fully restorable (as I have tested out with the always-failing Docker vm)

            So you patch your host, but not the administrative tools for the hosts?

            Seems a little cart before the horse there, no?

            P 1 Reply Last reply Reply Quote 0
            • P Offline
              peo @DustinB
              last edited by

              @DustinB said in Backups started to fail again (overall status: failure, but both snapshot and transfer returns success):

              @peo said in Backups started to fail again (overall status: failure, but both snapshot and transfer returns success):

              @olivierlambert Thanks, will update every machine and XO involved in the backup process, and possibly even the individual machines that fails. First failure on vm-cleanup was 15 July, that's a few days before I patched the hosts (as a part of troubleshooting and preventing further failures). Still these backups will (probably) be fully restorable (as I have tested out with the always-failing Docker vm)

              So you patch your host, but not the administrative tools for the hosts?

              Seems a little cart before the horse there, no?

              That's a fault-finding procedure: do not patch everything at once (but now I did, after finding out that patching the hosts did not solve the problem)

              P 1 Reply Last reply Reply Quote 0
              • P Offline
                peo @peo
                last edited by

                Since I updated 'everything' involved yesterday, the problems remain (this night's backups failed with the similar problem). As I'm again 6 commits behind the current version, I cannot create a useful bug report, so I'll just update and wait for the next scheduled backups to run (nothing the night towards Thursday, the next sequence will run at the night towards Friday)

                P 1 Reply Last reply Reply Quote 0
                • P Offline
                  peo @peo
                  last edited by

                  Since yesterday, even the replication jobs started to fail (I'm again 12 commits behind the current version, but other scheduled jobs continued to fail when I was up to date with XO)

                  The replication is set to run from one host and store on the SSD on another. I had a power failure yesterday, but both hosts needed for this job (xcp-ng-1 and xcp-ng-2) was back up and running at the time the job was started.

                  {
                    "data": {
                      "mode": "delta",
                      "reportWhen": "failure"
                    },
                    "id": "1753705802804",
                    "jobId": "0bb53ced-4d52-40a9-8b14-7cd1fa2b30fe",
                    "jobName": "Admin Ubuntu 24",
                    "message": "backup",
                    "scheduleId": "69a05a67-c43b-4d23-b1e8-ada77c70ccc4",
                    "start": 1753705802804,
                    "status": "failure",
                    "infos": [
                      {
                        "data": {
                          "vms": [
                            "1728e876-5644-2169-6c62-c764bd8b6bdf"
                          ]
                        },
                        "message": "vms"
                      }
                    ],
                    "tasks": [
                      {
                        "data": {
                          "type": "VM",
                          "id": "1728e876-5644-2169-6c62-c764bd8b6bdf",
                          "name_label": "Admin Ubuntu 24"
                        },
                        "id": "1753705804503",
                        "message": "backup VM",
                        "start": 1753705804503,
                        "status": "failure",
                        "tasks": [
                          {
                            "id": "1753705804984",
                            "message": "snapshot",
                            "start": 1753705804984,
                            "status": "success",
                            "end": 1753712867640,
                            "result": "4afbdcd9-818f-9e3d-555a-ad0943081c3f"
                          },
                          {
                            "data": {
                              "id": "46f9b5ee-c937-ff71-29b1-520ba0546675",
                              "isFull": false,
                              "name_label": "Local h2 SSD",
                              "type": "SR"
                            },
                            "id": "1753712867640:0",
                            "message": "export",
                            "start": 1753712867640,
                            "status": "interrupted"
                          }
                        ],
                        "infos": [
                          {
                            "message": "will delete snapshot data"
                          },
                          {
                            "data": {
                              "vdiRef": "OpaqueRef:c2504c79-d422-3f0a-d292-169d431e5aee"
                            },
                            "message": "Snapshot data has been deleted"
                          }
                        ],
                        "end": 1753717484618,
                        "result": {
                          "name": "BodyTimeoutError",
                          "code": "UND_ERR_BODY_TIMEOUT",
                          "message": "Body Timeout Error",
                          "stack": "BodyTimeoutError: Body Timeout Error\n    at FastTimer.onParserTimeout [as _onTimeout] (/opt/xo/xo-builds/xen-orchestra-202507262229/node_modules/undici/lib/dispatcher/client-h1.js:646:28)\n    at Timeout.onTick [as _onTimeout] (/opt/xo/xo-builds/xen-orchestra-202507262229/node_modules/undici/lib/util/timers.js:162:13)\n    at listOnTimeout (node:internal/timers:588:17)\n    at process.processTimers (node:internal/timers:523:7)"
                        }
                      }
                    ],
                    "end": 1753717484619
                  }
                  

                  Also, the replication job for my Debian XO machine fails with the same 'timeout' problem.

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    Should the timeout being raised @florent ?

                    P 1 Reply Last reply Reply Quote 0
                    • P Offline
                      peo @olivierlambert
                      last edited by

                      @olivierlambert I found a "solution" to the problem, by just rebooting the two involved hosts, but this might still be an issue somewhere (XO or even xcp-ng):

                      At the time I started up the hosts after the power failure, the dependencies had already been started a long time before (mainly my internet connectivity and the NAS which holds one of the SRs). All three hosts have their local 2TB SSD as well for different purposes (faster disk access, temporary storage and replication from other hosts).

                      I actually forgot to connect the network cable (unplugged because I reorganized the cables to the switch at the same time) to the third host (not involved in these recent problems) and found out that it seemed like it didn't start up properly (or at least, I did not get any video output from it when I was going to check its status after connecting the network cable), so I gave that one a hard reboot and got it up and running.

                      Machines with their disks on the local SSDs of the two other hosts have worked fine since I powered them up, so what follows (and the replication issue) was not expected at all:

                      Lock up on 'df' and 'ls /run/sr-mount/':

                      [11:21 xcp-ng-1 ~]# df -h
                      ^C
                      [11:21 xcp-ng-1 ~]# ^C
                      
                      [11:21 xcp-ng-1 ~]# ls /run/sr-mount/
                      ^C
                      [11:22 xcp-ng-1 ~]# ls /run/
                      

                      ('ls /run/' worked fine)

                      According to XO the disks were accessible and their content showed up as usual.

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Smells a blocked SR, do you have any NFS or SMB SR that's not responsive?

                        P 1 Reply Last reply Reply Quote 0
                        • P Offline
                          peo @olivierlambert
                          last edited by peo

                          @olivierlambert no, and all VMs were working at the time before I rebooted the two hosts (not the third one, since that didn't have problem accessing /run/sr-mount/)

                          I understand that 'df' will lock up if a NFS or SMB share does not respond, but ls the /run/sr-mount/ (without trying to access a subfolder) should have no reason to lock up (unless /run/sr-mount is not a ordinary folder, which it seems to be)

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            Check dmesg, there's something blocking the mount listing. An old ISO SR maybe?

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post