XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Job canceled to protect the VDI chain

    Scheduled Pinned Locked Moved Backup
    12 Posts 3 Posters 32 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M Offline
      McHenry
      last edited by

      Yesterday our backup job started failing for all VMs with the message:
      "Job canceled to protect the VDI chain"

      27cb4f46-c2fa-43e9-8bb0-5c9c5dd67b2a-image.png

      I have checked the docs regarding VDI chain protection:
      https://docs.xen-orchestra.com/backup_troubleshooting#vdi-chain-protection

      The xcp-ng logs do not show any errors:

      f40b4a41-ad4a-4cf7-be70-dda098c1e274-image.png

      19e6d7a1-d93a-4404-83bb-967b18364909-image.png

      I am using TrueNAS as shared storage.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        Maybe you simply have coalesce still running on your storage.

        M 1 Reply Last reply Reply Quote 0
        • M Offline
          McHenry @olivierlambert
          last edited by

          @olivierlambert

          I think you are correct. When I checked the Health it showed 46 to coalesce and then number started dropping down to zero. Now the backups appear to be running again 🙂

          I have never seen this before and I am curious as to why it appeared yesterday.

          My fear was storage corruption, as with shared storage it would impact all VMs. I checked TrueNAS and everything appears to be be healthy.

          aa938a4d-33c8-4a55-ab6a-721e6fb3b909-image.png

          86a09009-3ff8-42d7-8fa6-ba283b49fa22-image.png

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Online
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Hard to tell, but instead of adding even more pressure, XO automatically skipped until your chain coalesced correctly.

            M 2 Replies Last reply Reply Quote 1
            • M Offline
              McHenry @olivierlambert
              last edited by

              @olivierlambert

              Is it XO or xcp-ng that manages the coalescing? Can more resources be applied to assist?

              1 Reply Last reply Reply Quote 0
              • M Offline
                McHenry @olivierlambert
                last edited by

                @olivierlambert

                I spoke too soon. The backups started working however the problem has returned.
                221c9885-e718-4a43-b948-db80820666e3-image.png

                I do see 44 items waiting to coalesce. This is new as these would coalesce faster previously without causing this issue.
                06fdf997-95bc-4670-ae48-6c6dd31bbe33-image.png

                Is there a reason the coalesce is taking longer now or is there a way I can add resources to speed up the process?

                M 1 Reply Last reply Reply Quote 0
                • M Offline
                  McHenry @McHenry
                  last edited by

                  I have the following entry in the logs, over and over. Not sure if this is a problem:

                  Oct 29 15:25:08 HST106 SMGC: [1009624] Found 1 orphaned vdis
                  Oct 29 15:25:08 HST106 SM: [1009624] lock: tried lock /var/lock/sm/be743b1c-7803-1943-0a70-baf5fcbfeaaf/sr, acquired: True (exists: True)
                  Oct 29 15:25:08 HST106 SMGC: [1009624] Found 1 VDIs for deletion:
                  Oct 29 15:25:08 HST106 SMGC: [1009624]   *d4a17b38(100.000G/21.652G?)
                  Oct 29 15:25:08 HST106 SMGC: [1009624] Deleting unlinked VDI *d4a17b38(100.000G/21.652G?)
                  Oct 29 15:25:08 HST106 SMGC: [1009624] Checking with slave: ('OpaqueRef:16797af5-c5d1-08d5-0e26-e17149c2807b', 'nfs-on-slave', 'check', {'path': '/var/run/sr-mount/be743b1c-7803-1943-0a70-baf5fcbfeaaf/d4a17b38-5a3c-438a-b394-fcbb64784499.vhd'})
                  Oct 29 15:25:08 HST106 SM: [1009624] lock: released /var/lock/sm/be743b1c-7803-1943-0a70-baf5fcbfeaaf/sr
                  Oct 29 15:25:08 HST106 SM: [1009624] lock: released /var/lock/sm/be743b1c-7803-1943-0a70-baf5fcbfeaaf/running
                  Oct 29 15:25:08 HST106 SMGC: [1009624] GC process exiting, no work left
                  Oct 29 15:25:08 HST106 SM: [1009624] lock: released /var/lock/sm/be743b1c-7803-1943-0a70-baf5fcbfeaaf/gc_active
                  Oct 29 15:25:08 HST106 SMGC: [1009624] In cleanup
                  Oct 29 15:25:08 HST106 SMGC: [1009624] SR be74 ('Shared NAS002') (166 VDIs in 27 VHD trees): no changes
                  Oct 29 15:25:08 HST106 SM: [1009624] lock: closed /var/lock/sm/be743b1c-7803-1943-0a70-baf5fcbfeaaf/running
                  Oct 29 15:25:08 HST106 SM: [1009624] lock: closed /var/lock/sm/be743b1c-7803-1943-0a70-baf5fcbfeaaf/gc_active
                  Oct 29 15:25:08 HST106 SM: [1009624] lock: closed /var/lock/sm/be743b1c-7803-1943-0a70-baf5fcbfeaaf/sr
                  
                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Online
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    How long roughly before coalesce is done? Coalesce is a storage task directly done by XCP-ng, XO is just witnessing it.

                    It's normal to have disks to coalesce after snapshot removal.

                    P 1 Reply Last reply Reply Quote 0
                    • P Online
                      Pilow @olivierlambert
                      last edited by

                      @olivierlambert if its at XCP level, is there a ratio where storage access performance should be especially monitored ?

                      i guess that 500 VMs desnapshoted on 1 host is managed differently than 500VMs on 50 hosts.

                      but if all hosts are on same shared storage... performance constraint ?

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Online
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Coalescing is indeed generating some IOPS, latency is in general the most visible impact on it. However, it's managed per storage, regardless the number of VDIs, they are coalesced one by one.

                        P 1 Reply Last reply Reply Quote 1
                        • P Online
                          Pilow @olivierlambert
                          last edited by

                          @olivierlambert sor for high number of VMs, there is a point in time where you should be aware of not doing DR too frequently to let coalesce proceed.
                          50VMs in a DR job, every hour, if coalesce takes 2min by VM, it didnt finish when the next DR is starting ?

                          i'm looking for the edge case

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Online
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            This is entirely dependent of your setup, there's no universal rule. But thanks to XO skipping the job, you can adapt your configuration to reduce the number of times it happens.

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post