XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    S3 backup not retrying after error

    Scheduled Pinned Locked Moved Xen Orchestra
    22 Posts 4 Posters 1.6k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      I think we did not have a lot of issues reported on AWS, our concerns are on "similar" implementation but not behaving the same.

      A 1 Reply Last reply Reply Quote 0
      • florentF Offline
        florent Vates 🪐 XO Team @julien-f
        last edited by olivierlambert

        @julien-f I didn't find a debug option, but we can put a custom logger. I did quick test with a direct console/log.info and the volume of data logged is huge. I'm trying to make a more reasonable logger

        1 Reply Last reply Reply Quote 1
        • A Offline
          Andrew Top contributor @olivierlambert
          last edited by

          @olivierlambert @florent I chose Wasabi because of cost. I understand that non-AWS S3 vendors may not always work as expected but Wasabi is not insignificant.... Local testing to MinIO S3 has never given me any problems.

          With tens of thousands of S3 operations I only want to know about the one that failed so I can yell a Wasabi and get them to fix their issue. I have recently been having more nightly issues with them (same retry failures).

          So it's a two fold issue. Wasabi is causing problems and XO needs to be more tolerant of problems.

          florentF 2 Replies Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            I'm not saying to use only AWS 😉 It's just telling you that we do not have errors reported for AWS customers on S3. It can be for a lot of reasons.

            We have invested 10k€ in test hardware to install our own "decent" S3 cluster, so we are taking seriously the fact to try to make it work on other stuff. But as you can see, there's a limit on the number of platforms we can try. Indeed, the solution is to get more debugs, but it's not enjoyable to lose time on potential issues related to 3rd party S3 providers 😕

            1 Reply Last reply Reply Quote 0
            • florentF Offline
              florent Vates 🪐 XO Team @Andrew
              last edited by

              @Andrew I am sure we'll be able to tune XO to be a little more robust in the near future

              1 Reply Last reply Reply Quote 0
              • florentF Offline
                florent Vates 🪐 XO Team @Andrew
                last edited by

                @Andrew hi ,

                I made a PR that retries 5 times even on error 400, can you test it ?
                https://github.com/vatesfr/xen-orchestra/pull/6433

                Regards
                Florent

                fbeauchamp opened this pull request in vatesfr/xen-orchestra

                closed fix(s3): retry upload even on error 400 #6433

                A 2 Replies Last reply Reply Quote 0
                • A Offline
                  Andrew Top contributor @florent
                  last edited by

                  @florent Thanks. I'm running it. I'll report after a few days.

                  1 Reply Last reply Reply Quote 1
                  • A Offline
                    Andrew Top contributor @florent
                    last edited by

                    @florent Last night's failure (commit 81ae8)...

                    {
                          "data": {
                            "type": "VM",
                            "id": "f80fdf51-65e5-132d-bb2a-936bbd2814fc"
                          },
                          "id": "1663912365483:2",
                          "message": "backup VM",
                          "start": 1663912365483,
                          "status": "failure",
                          "tasks": [
                            {
                              "id": "1663912365570",
                              "message": "clean-vm",
                              "start": 1663912365570,
                              "status": "failure",
                              "end": 1663912403372,
                              "result": {
                                "name": "InternalError",
                                "$fault": "client",
                                "$metadata": {
                                  "httpStatusCode": 500,
                                  "extendedRequestId": "jOYV90/W5XHJFnOq1mlfpaMT/T9EV4/EnSluEni+p9TJQykrtI0cJMntJqFThy/PvX/LN0XX4xXS",
                                  "attempts": 3,
                                  "totalRetryDelay": 369
                                },
                                "Code": "InternalError",
                                "Detail": "None:UnexpectedError",
                                "RequestId": "85780FD1B7DFCB7C",
                                "HostId": "jOYV90/W5XHJFnOq1mlfpaMT/T9EV4/EnSluEni+p9TJQykrtI0cJMntJqFThy/PvX/LN0XX4xXS",
                                "message": "We encountered an internal error.  Please retry the operation again later.",
                                "stack": "InternalError: We encountered an internal error.  Please retry the operation again later.\n    at throwDefaultError (/opt/xo/xo-builds/xen-orchestra-202209221033/node_modules/@aws-sdk/smithy-client/dist-cjs/default-error-handler.js:8:22)\n    at deserializeAws_restXmlGetObjectCommandError (/opt/xo/xo-builds/xen-orchestra-202209221033/node_modules/@aws-sdk/client-s3/dist-cjs/protocols/Aws_restXml.js:4356:51)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async /opt/xo/xo-builds/xen-orchestra-202209221033/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24\n    at async /opt/xo/xo-builds/xen-orchestra-202209221033/node_modules/@aws-sdk/middleware-signing/dist-cjs/middleware.js:11:20\n    at async StandardRetryStrategy.retry (/opt/xo/xo-builds/xen-orchestra-202209221033/node_modules/@aws-sdk/middleware-retry/dist-cjs/StandardRetryStrategy.js:51:46)\n    at async /opt/xo/xo-builds/xen-orchestra-202209221033/node_modules/@aws-sdk/middleware-flexible-checksums/dist-cjs/flexibleChecksumsMiddleware.js:56:20\n    at async /opt/xo/xo-builds/xen-orchestra-202209221033/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:6:22\n    at async S3Handler._createReadStream (/opt/xo/xo-builds/xen-orchestra-202209221033/@xen-orchestra/fs/dist/s3.js:261:15)\n    at async S3Handler.readFile (/opt/xo/xo-builds/xen-orchestra-202209221033/@xen-orchestra/fs/dist/abstract.js:326:18)"
                              }
                            },
                            {
                              "id": "1663912517635",
                              "message": "snapshot",
                              "start": 1663912517635,
                              "status": "success",
                              "end": 1663912520335,
                              "result": "85b00101-5704-c847-8c91-8806195154b4"
                            },
                            {
                              "data": {
                                "id": "db9ad0a8-bce6-4a2b-b9fd-5c4cecf059c4",
                                "isFull": false,
                                "type": "remote"
                              },
                              "id": "1663912520336",
                              "message": "export",
                              "start": 1663912520336,
                              "status": "success",
                              "tasks": [
                                {
                                  "id": "1663912520634",
                                  "message": "transfer",
                                  "start": 1663912520634,
                                  "status": "success",
                                  "end": 1663912549741,
                                  "result": {
                                    "size": 251742720
                                  }
                                },
                                {
                                  "id": "1663912551469",
                                  "message": "clean-vm",
                                  "start": 1663912551469,
                                  "status": "success",
                                  "end": 1663912629752,
                                  "result": {
                                    "merge": false
                                  }
                                }
                              ],
                              "end": 1663912629752
                            }
                          ],
                          "end": 1663912629752
                        },
                    
                    florentF 1 Reply Last reply Reply Quote 0
                    • florentF Offline
                      florent Vates 🪐 XO Team @Andrew
                      last edited by

                      @Andrew Have you got any retrying writing file message in the logs ?

                      A 1 Reply Last reply Reply Quote 0
                      • A Offline
                        Andrew Top contributor @florent
                        last edited by

                        @florent No, I did not see that in the logs. I did see this problem is bigger than I thought.

                        It happens more often than just causing a VM backup failure. It happens during the merge or other checks which causes the backup process to destroy (remove) parts of other VM backups.

                         Clean VM directory 
                         parent VHD is missing
                         parent VHD is missing
                         parent VHD is missing
                         some VHDs linked to the backup are missing
                         some VHDs linked to the backup are missing
                         some VHDs linked to the backup are missing
                         some VHDs linked to the backup are missing
                        

                        and

                         Clean VM directory 
                         VHD check error
                         some VHDs linked to the backup are missing
                        
                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post