XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    VDI Chain on Deltas

    Scheduled Pinned Locked Moved Backup
    12 Posts 3 Posters 117 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Online
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Hi,

      Can you be more specific on what's going on exactly?

      nvossN 1 Reply Last reply Reply Quote 0
      • nvossN Offline
        nvoss @olivierlambert
        last edited by

        @olivierlambert Sure I can try.

        I can confirm now though that with both my full and my delta jobs that they fail with every single VM on the "Job canceled to protect the VDI chain" error.

        30ff134e-b88f-40a6-9d81-d46e806f12e3-image.png

        If we do a standard restart then it fails the same way. If we use the "force restart" option then it does work properly and backups seem to finish without issue.

        The remote configuration is brand new with encrypted remotes with the multiple data block option selected. The backup job itself is not new, it's been in place for about a year. The job uses VM tags to determine which VMs to backup. The full is a weekly run with 6 retained backups, it remotes to both the external and local. The delta only goes to the local synology and is set with 14 retained backups.

        The storage for the VMs is on a Synology NAS. The VMs live on one of 3 hosts with similar vintage hardware.

        Per the backup troubleshooting article:
        cat /var/log/SMlog | grep -i exception : no results
        cat /var/log/SMlog | grep -i error : no results
        grep -i coales /var/log/SMlog : lots of messages that say "UNDO LEAF-COEALESCE"

        b73ccb49-d5db-4da6-8137-48e5a5f98245-image.png

        The host I ran those commands on is the one which houses the Xen Orchestra VM (whose backup also fails).

        The synology backup remote has 10TB assigned to it with 8.7TB free. The VDI disk volume has 5.4TB of 10TB free.

        Status on the hosts patch-wise shows 6 patches are needed currently, though they were up-to-date last week.

        XO is on commit 9ed55.

        Other specifics I can provide?

        Thanks!
        Nick

        dthenotD 1 Reply Last reply Reply Quote 0
        • dthenotD Online
          dthenot Vates 🪐 XCP-ng Team @nvoss
          last edited by

          @nvoss Hello, The UNDO LEAF-COEALESCE usually has a cause that is listed in the error above it. Could you share this part please? 🙂

          nvossN 1 Reply Last reply Reply Quote 0
          • nvossN Offline
            nvoss @dthenot
            last edited by

            @dthenot when I grep looking for coalesce I don't see any errors. Everything is the undo message.

            Looking at the line labeled 3680769 in this case corresponding with one of those undo's I see lock opens, variety of what looks like successful mounts and subsequent snapshot activity then at the end the undo. After the undo message I see something not super helpful.

            Attached is that entire region. Below an excerpt.

            887f1018-e6f0-4722-a6e9-324c08ecd9a2-image.png

            It's definitely confusing as to why a force on the job works instead of the regular run?

            Errored Coalesce.txt

            dthenotD 1 Reply Last reply Reply Quote 0
            • dthenotD Online
              dthenot Vates 🪐 XCP-ng Team @nvoss
              last edited by

              @nvoss Could you try to run vhd-util check -n /var/run/sr-mount/f23aacc2-d566-7dc6-c9b0-bc56c749e056/3a3e915f-c903-4434-a2f0-cfc89bbe96bf.vhd?

              nvossN 1 Reply Last reply Reply Quote 0
              • nvossN Offline
                nvoss @dthenot
                last edited by

                @dthenot sure, here you go!

                0b424abf-68ef-4fb3-a8d3-b81001f0f314-image.png

                dthenotD 1 Reply Last reply Reply Quote 0
                • dthenotD Online
                  dthenot Vates 🪐 XCP-ng Team @nvoss
                  last edited by

                  @nvoss The VHD is reported corrupted on the batmap. You can try to repair it with vhd-util repair but it'll likely not work.
                  I have seen people recover from this kind of error by doing a vdi-copy.
                  You could try a VM copy or a VDI copy and link the VDI to the VM again and see if it's alright.
                  The corrupted VDI is blocking the garbage collector so the chain are long and that's the error you see on XO side.
                  It might be needed to remove the chain by hand to resolve the issue.

                  nvossN 1 Reply Last reply Reply Quote 0
                  • nvossN Offline
                    nvoss @dthenot
                    last edited by

                    @dthenot every one of our VMs report this same error on a scheduled backup. Does that mean every one has this problem?

                    I'm not sure how it would've happened? It seems like the problem started after doing a rolling update to the 3 hosts about 2 months back.

                    I'm also not super clear on what the batmap is 🙂 -- just a shade out of my depth!

                    Appreciate all the suggestions though. Happy to try stuff. Migrating the VD to local storage and back to the NAS, etc?

                    What would make the force restart work when the scheduled regular runs dont?

                    dthenotD 2 Replies Last reply Reply Quote 0
                    • dthenotD Online
                      dthenot Vates 🪐 XCP-ng Team @nvoss
                      last edited by

                      @nvoss No, the GC is blocked because only one VDI is corrupted, the one with the check.
                      All other VDI are on a long chain because they couldn't coalesce.
                      Sorry, BATMAP is the block allocation table, it's the info of the VHD to know which block exist locally.
                      Migrating the VDI might work indeed, I can't really be sure.

                      1 Reply Last reply Reply Quote 0
                      • dthenotD Online
                        dthenot Vates 🪐 XCP-ng Team @nvoss
                        last edited by

                        @nvoss said in VDI Chain on Deltas:

                        What would make the force restart work when the scheduled regular runs dont?

                        I'm not sure what you mean.
                        The backup need to do a snapshot to have a point to compare before exporting data.
                        This snapshot will create a new level of VHD that would need to be coalesced, but it's limiting the number of VHD in the chain so it fails.
                        This is caused by the fact that the garbage collector can't run because it can't edit the corrupted VDI.
                        Since there is a corrupted VDI it's not running to not create more problem on the VDI chains.
                        Sometime corruption mean that we don't know if a VHD has any parent for example, and if doing so we can't know what the chain looks like meaning not knowing what VHD are in what chain in the SR (Storage Repository).

                        VDI: Virtual Disk Image in this context
                        VHD being the format of VDI we use at the moment in XCP-ng

                        After removing the corrupted VDI, maybe automatically by the migration process (maybe you'll have to do it by hand), you can run a sr-scan on the SR and it launch the GC again.

                        nvossN 1 Reply Last reply Reply Quote 0
                        • nvossN Offline
                          nvoss @dthenot
                          last edited by

                          @dthenot sorry not sure how the "force restart" button option works for both our full and our delta backups vs the regular scheduled backup jobs because doing the force restart lets the job run fully each time regardless of the specific machine that may have the bad/corrupt disk? That's the orange button

                          And a manual snapshot works on all machines I believe too?

                          Is there a smooth way to track that VHD disk GUID back to its machine in the interface?

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post