XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Delta backup fails for specific vm with VDI chain error

    Scheduled Pinned Locked Moved Xen Orchestra
    79 Posts 5 Posters 10.6k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M Offline
      mbt
      last edited by mbt

      Hi!

      We have a pool with two xcp-ng 8.0.0 hosts, both using a local LVM SR (SSD, 5.2TB capacity each, 1.7 TB allocated). We use XO 5.47.0 from source. East host currently hosts 5 to 7 vms.
      During daily delta backup one vm currently gets skipped. The job says "Job cancelled to protect VDI chain".
      There's only one snapshot per VM. We already deleted them, live-migrated the VM in question to the other host and back but nothing changed.

      According to the documentation we ran some checks:

      [10:22 rigel ~]# cat /var/log/SMlog | grep -i exception
      ...
      Aug 21 10:19:19 rigel SM: [11707] Ignoring exception for LV check: /dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/7ef76d55-683d-430f-91e6-39e5cceb9ec1.cbtlog !
      Aug 21 10:19:21 rigel SM: [12061] Ignoring exception for LV check: /dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/7ef76d55-683d-430f-91e6-39e5cceb9ec1.cbtlog !
      Aug 21 10:19:24 rigel SM: [12591] Ignoring exception for LV check: /dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/7ef76d55-683d-430f-91e6-39e5cceb9ec1.cbtlog !
      Aug 21 10:22:18 rigel SM: [14509] Ignoring exception for LV check: /dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/7ef76d55-683d-430f-91e6-39e5cceb9ec1.cbtlog !
      Aug 21 10:22:19 rigel SM: [14706] Ignoring exception for LV check: /dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/7ef76d55-683d-430f-91e6-39e5cceb9ec1.cbtlog !
      Aug 21 10:22:21 rigel SM: [15243] Ignoring exception for LV check: /dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/7ef76d55-683d-430f-91e6-39e5cceb9ec1.cbtlog !
      

      There's no errors in the SMlog, though.
      Seems to me the coalescing doesn't work as intended:

      [10:31 rigel ~]# grep -i coales /var/log/SMlog
      ...
      Aug 21 10:31:08 rigel SMGC: [628] Coalesced size = 500.949G
      Aug 21 10:31:08 rigel SMGC: [628]   Running VHD coalesce on *e25e01be[VHD](500.000G//152.000M|ao)
      Aug 21 10:31:08 rigel SM: [22349] ['/usr/bin/vhd-util', 'coalesce', '--debug', '-n', '/dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/VHD-e25e01be-4e8a-4c69-b6e5-1e004276c744']
      Aug 21 10:31:10 rigel SM: [628] ['/sbin/lvremove', '-f', '/dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/coalesce_e25e01be-4e8a-4c69-b6e5-1e004276c744_1']
      Aug 21 10:31:10 rigel SM: [628] ['/sbin/dmsetup', 'status', 'VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-coalesce_e25e01be--4e8a--4c69--b6e5--1e004276c744_1']
      Aug 21 10:31:11 rigel SMGC: [628] Snapshot-coalesce did not help, abandoning attempts
      Aug 21 10:31:11 rigel SMGC: [628] Set leaf-coalesce = offline for 7ef76d55[VHD](500.000G/270.645M/500.984G|ao)
      Aug 21 10:31:11 rigel SMGC: [628] Removed leaf-coalesce from 7ef76d55[VHD](500.000G/270.645M/500.984G|ao)
      Aug 21 10:31:13 rigel SMGC: [628] Coalesced size = 500.949G
      Aug 21 10:31:13 rigel SMGC: [628] Leaf-coalesce candidate: 7ef76d55[VHD](500.000G/272.649M/500.984G|ao)
      Aug 21 10:31:13 rigel SMGC: [628] Coalesced size = 500.949G
      Aug 21 10:31:13 rigel SMGC: [628] Leaf-coalesce candidate: 7ef76d55[VHD](500.000G/272.649M/500.984G|ao)
      Aug 21 10:31:13 rigel SMGC: [628] Leaf-coalescing 7ef76d55[VHD](500.000G/272.649M/500.984G|ao) -> *43454904[VHD](500.000G//500.949G|ao)
      Aug 21 10:31:16 rigel SMGC: [628] Coalescing parent *734e9652[VHD](500.000G//276.000M|ao)
      Aug 21 10:31:16 rigel SM: [628] ['/sbin/lvcreate', '-n', 'coalesce_734e9652-b66b-41e1-a6b4-16796a732a69_1', '-L', '4', 'VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c', '--addtag', 'journaler', '-W', 'n']
      
      
      [10:33 rigel ~]# ps axf | grep vhd
      27491 pts/10   S+     0:00          \_ grep --color=auto vhd
      26297 ?        R      2:21  \_ /usr/bin/vhd-util check --debug -n /dev/VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/VHD-43454904-e56b-4375-b2fb-40691ab28e12
      

      As I said, migrating the VM to the other host and back did not help.
      We already had this issue with the same vm in question a few weeks ago. Migrating back and forth did not help either. What helped back then was updating XO (Ubuntu 16.04 LTS) from 5.40.0/5.40.1 to 5.46.0 and rebooting the XO vm. I updated the XO vm yesterday to 5.47.0 and rebooted it but tis time my "trick" didn't work.

      Any idea on how to solve this issue?

      Regards,
      Alex

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates πŸͺ Co-Founder CEO
        last edited by olivierlambert

        Don't do migrate in every direction, it's useless. This is not an error but a protection. You need to really understand what coalesce means.

        First, it happens automatically in the host every time a snapshot is removed. Manually or by a XO backup job. XO doesn't have any info or control on that process. So it's not a XO "thing". In XO, we decided to control the VDI chain length to avoid having a chain growing faster than your SR can coalesce (hence the protection).

        Coalesce isn't instantly done. You can see the disk needing coalesce in XO, SR view, advanced. Then you just have to wait.

        Especially in your case, coalesce seems to run. So just be patient and check the "Advanced" tab. As soon it's empty, it means the chain is clean and ready for a next run.

        In other word, your SR can't keep up with the pace of your backup schedule. Triggering the protection avoid the creation of a new snapshot and it's not a bad thing. It means it will be skipped until next time.

        1 Reply Last reply Reply Quote 0
        • M Offline
          mbt
          last edited by

          Hi Olivier,

          thank your for taking the time to explain what's going on under the hood - very interesting stuff! πŸ™‚
          So I basically tried to fix a problem that needed no fixing and maybe thereby prolonged the whole process.

          Is there a way to monitor the progress of a leaf-coalesce and is there a way to speed it up, other than shtting the VM down and doing it offline?
          The delta backup job runs daily and I assumed the servers would be fast enough. They've recently been upgraded from 10k SAS drives to SSD. It's only SATA and just RAID 5, so I don't expect miracles but thus far it had been running well the past couple of years.

          All stats are basically telling me the sr load is low - it rarely jumps to 200-300 IOPS, low single digit MiB/s, low 0.x/ IOwait, 37ms write latency, ...

          Regards,
          Alex

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates πŸͺ Co-Founder CEO
            last edited by

            1. No way to track the coalesce speed, there's absolutely no API whatsoever to monitor this 😞 You can have a rough idea by knowing your average time to coalesce (remove a snapshot and watch the Advanced view in XO). Bigger the snapshot (old or a lot of writes since then), longer the coalesce.
            2. Coalesce is low prio to avoid competing with needed IOPS. However, IDK how they set this up. It's deep in the middle of OCaml code…

            You can rescan the SR to try to trigger it faster. When it started, you'll have a vhd process running, so you can use htop to watch it and saw the priority and disk R/W of the process.

            1 Reply Last reply Reply Quote 0
            • M Offline
              mbt
              last edited by

              OCaml!? Really!? Brrrr....

              OK, it seems there's a vhd-util process constantly consuming a single CPU core, rarely reading or writing any data at all. Probably going through larger chunks of data. After a couple of minutes the process ends and another one starts.

              Let's assume it's doing its thing.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates πŸͺ Co-Founder CEO
                last edited by

                Again, you can check the "result" on the "Advanced" tab of your SR in XO πŸ™‚

                1 Reply Last reply Reply Quote 0
                • AnonabharA Offline
                  Anonabhar
                  last edited by

                  Hi mbt,

                  Recently I have had this problem a lot.

                  in XOA, bring up "Storages" and locate the SR. Click on Advanced and see if that VHD is showing up as needing to be coalessed. Does the depth oscillate between 1,2,1,2,1,2,etc.?

                  If so, shutdown the VM, and force a rescan of the SR.

                  There will be a 5 minute pause between the time the GC goes quiet to when it goes active. Let the system do a offline coalesce on the VHD and see if that clears it.

                  ~Peg

                  1 Reply Last reply Reply Quote 0
                  • AnonabharA Offline
                    Anonabhar
                    last edited by

                    I also find:

                    tail -f /var/log/SMlog | grep SMGC
                    

                    to be my friend... Worrying... But hypnotic to watch at 3AM 8-)

                    1 Reply Last reply Reply Quote 0
                    • M Offline
                      mbt
                      last edited by

                      Hi Peg,

                      thanks for the information!
                      The situation stayed the same, so the online coalescing for this paricular vm still doesn't work right now and therefore it gets skipped by the daily delta backup job.
                      I think yesterday I saw the depth oscillate for a splitsecond down to 1 and then back to 2.

                      Unfortunately the VM is in access 24/7 by users in all kinds of time zones around the world. I'll have to schedule a maintenance window with the customer and hope this gets the job done. Not being able to backup a vm makes me nervous.

                      -alex

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates πŸͺ Co-Founder CEO
                        last edited by

                        The only reason I know for this case would be: creating more new blocks than the SR can coalesce. You can isolate this VM on another SR to see if the situation is improved.

                        1 Reply Last reply Reply Quote 0
                        • M Offline
                          mbt
                          last edited by

                          Now something strange happened:

                          To have at least some sort of recent backup I started a vm copy to the other host. It is currently only 10% through but while it was working I saw some activity in the SMLog and one after another, the stalled coalesce jobs disappeared.

                          I'll wait for the copy to finish and see if backup is now working again.

                          I'm with you Olivier, on "creating more new blocks than the SR can coalesce", but there's only double digit or low triple digit IOPS on the sr, barely throughput, etc. So there's very low load. Whenever some of the other vms' vdis get coalesced or when they are backupped stats pick up and we get several thousand IOPS, etc.
                          That's why I'd say that bare metal performance can't be the issue here.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates πŸͺ Co-Founder CEO
                            last edited by

                            It's not bare metal perf, it's how coalesce prio is set with hard limits in the code. Feel free to dig inside it, I never had the time to do it myself πŸ™‚

                            1 Reply Last reply Reply Quote 0
                            • AnonabharA Offline
                              Anonabhar
                              last edited by

                              Hi mbt,

                              Sadly, when thing get that messy for me, I find that the only solution is to export the VM and then import. Thats one the (longest but most reliable) ways of de-leafing the tree.

                              Now, a trick I use, if downtime is an issue, is to do a Disaster Recovery backup to the same machine or SR. Once the backup is complete, I copy the MAC addresses for the network cards to the new VM (it will complain, but its OK), clear the blocked operations on the new VM, shutdown the original and then start the new.

                              If you use the above method, you can clear the blocked operation by using CLI:

                              xe vm-param-clear param-name=blocked-operations uuid=<NEW UUID>
                              

                              If you are using the "enterprise" or open source version, you can eliminate the amount of data that will be missing from the point of starting the DR backup and the starting up of the new server by performing a Continuous Replication while the server is still running and after the replication is finished, shut the server down, do a second Continuous Replication to sync the images and then perform the steps I mentioned above.

                              If you do the above, your downtime is in minutes and not longer.

                              I hope that helps a bit,
                              Peg

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates πŸͺ Co-Founder CEO
                                last edited by

                                We are not aware of that kind of coalesce issues, what kind of SRs do you have?

                                1 Reply Last reply Reply Quote 0
                                • AnonabharA Offline
                                  Anonabhar
                                  last edited by

                                  Hi Oliver,

                                  The problem is of my own making and in no way blame XO for this. Its a perfect storm of events that caused my problem. Basically, I had a schedual that did delta backups to a local NFS repo. Later that night, I have 2 other jobs that do Cont Replication to 2 other DC's. Lastly, I have a monthly backup that spools to another NFS Repo that we backup to DLT tapes.. Basically what happened is some of the jobs got delayed finishing and they over-ran and were running on top of each other. The GC got overloaded and started to throw "Exception Timeout" and the situation just kept building and got worse. As a consequence, I started to get "Weird" things happening in the running VM's (long delays, sluggishness, crashing of daemons, etc..) Once I realized what was going on, it was to late and I am doing the above to about 200+ VM's to get it fixed.

                                  We are using LLVMoiSCSI on Equal Logic PS6000's as the backend

                                  If you want a laugh, this is what one of my VDI tree's looks like:

                                  vhd=VHD-036594ed-6814-4afa-a063-c702d575316e capacity=21474836480 size=11580473344 hidden=1 parent=none
                                     vhd=VHD-0bf17661-b95f-4b5f-adf6-ef5b2ff0a039 capacity=21474836480 size=8388608 hidden=0 parent=VHD-036594ed-6814-4afa-a063-c702d575316e
                                     vhd=VHD-40df70e3-25b9-4875-a995-0f7497924afd capacity=21474836480 size=901775360 hidden=1 parent=VHD-036594ed-6814-4afa-a063-c702d575316e
                                        vhd=VHD-8b409ba8-77f4-4270-a36d-6d4c0091ed3c capacity=21474836480 size=8388608 hidden=0 parent=VHD-40df70e3-25b9-4875-a995-0f7497924afd
                                        vhd=VHD-5ec35406-34c0-4c9c-b92d-80df45f5e6c1 capacity=21474836480 size=1094713344 hidden=1 parent=VHD-40df70e3-25b9-4875-a995-0f7497924afd
                                           vhd=VHD-e039102b-ac33-487a-b38b-ccd60cf83573 capacity=21474836480 size=8388608 hidden=0 parent=VHD-5ec35406-34c0-4c9c-b92d-80df45f5e6c1
                                           vhd=VHD-ccdfe0c5-6044-4131-8ce5-3b2129f083c2 capacity=21474836480 size=843055104 hidden=1 parent=VHD-5ec35406-34c0-4c9c-b92d-80df45f5e6c
                                  1
                                              vhd=VHD-26560aef-677c-498f-81e0-91cd791ec2ee capacity=21474836480 size=8388608 hidden=0 parent=VHD-ccdfe0c5-6044-4131-8ce5-3b2129f083
                                  c2
                                              vhd=VHD-91abb814-214b-4fe7-aac6-20f0177348cb capacity=21474836480 size=977272832 hidden=1 parent=VHD-ccdfe0c5-6044-4131-8ce5-3b2129f0
                                  83c2
                                                 vhd=VHD-78cac789-3a46-4d79-87ff-0e8d28c17916 capacity=21474836480 size=8388608 hidden=0 parent=VHD-91abb814-214b-4fe7-aac6-20f0177
                                  348cb
                                                 vhd=VHD-aab84809-2165-4af3-9f47-a1b1ff4044f8 capacity=21474836480 size=1019215872 hidden=1 parent=VHD-91abb814-214b-4fe7-aac6-20f0
                                  177348cb
                                                    vhd=VHD-cdb89a9e-a583-461c-adc6-7b68c0e88556 capacity=21474836480 size=8388608 hidden=0 parent=VHD-aab84809-2165-4af3-9f47-a1b1
                                  ff4044f8
                                                    vhd=VHD-78d28f42-9327-440c-bd72-46b0716252a8 capacity=21474836480 size=268435456 hidden=1 parent=VHD-aab84809-2165-4af3-9f47-a1
                                  b1ff4044f8
                                                       vhd=VHD-76360be2-10b7-4be1-8cb6-b2595ed8f5df capacity=21474836480 size=8388608 hidden=0 parent=VHD-78d28f42-9327-440c-bd72-4
                                  6b0716252a8
                                                       vhd=VHD-492d0a9f-8a1d-469d-8bbc-9bbdd910a83e capacity=21474836480 size=994050048 hidden=1 parent=VHD-78d28f42-9327-440c-bd72
                                  -46b0716252a8
                                                          vhd=VHD-9276a1dc-185b-4299-b958-415dc52ee2f8 capacity=21474836480 size=8388608 hidden=0 parent=VHD-492d0a9f-8a1d-469d-8bb
                                  c-9bbdd910a83e
                                                          vhd=VHD-c5017f51-309a-4e49-bf42-ce6a838212d2 capacity=21474836480 size=1111490560 hidden=1 parent=VHD-492d0a9f-8a1d-469d-
                                  8bbc-9bbdd910a83e
                                                             vhd=VHD-22a8f4fb-11ca-4caa-8439-ac8082482ca3 capacity=21474836480 size=8388608 hidden=0 parent=VHD-c5017f51-309a-4e49-
                                  bf42-ce6a838212d2
                                                             vhd=VHD-b0151f77-e694-437b-8ff2-f352871ad37d capacity=21474836480 size=243269632 hidden=1 parent=VHD-c5017f51-309a-4e4
                                  9-bf42-ce6a838212d2
                                                                vhd=VHD-25fffec1-fb4d-4717-8bff-c2050eb49aa2 capacity=21474836480 size=8388608 hidden=0 parent=VHD-b0151f77-e694-43
                                  7b-8ff2-f352871ad37d
                                                                vhd=VHD-a0e6091b-3560-40bd-85ec-eb251e47366c capacity=21474836480 size=918552576 hidden=1 parent=VHD-b0151f77-e694-
                                  437b-8ff2-f352871ad37d
                                                                   vhd=VHD-c40b1169-7d28-464c-9a6f-5309c7e2382d capacity=21474836480 size=8388608 hidden=0 parent=VHD-a0e6091b-3560
                                  -40bd-85ec-eb251e47366c
                                                                   vhd=VHD-1038bb56-d4d9-49ef-b2fb-10319423cbd1 capacity=21474836480 size=264241152 hidden=1 parent=VHD-a0e6091b-35
                                  60-40bd-85ec-eb251e47366c
                                                                      vhd=VHD-85b6bd0c-da46-41ab-8483-a831beeb0a0a capacity=21474836480 size=8388608 hidden=0 parent=VHD-1038bb56-d
                                  4d9-49ef-b2fb-10319423cbd1
                                                                      vhd=VHD-52b51deb-1c26-4101-a6b0-c2fde959c66b capacity=21474836480 size=830472192 hidden=1 parent=VHD-1038bb56
                                  -d4d9-49ef-b2fb-10319423cbd1
                                                                         vhd=VHD-4aa2d7d6-cf34-449d-8ee1-98e9b6a19bf5 capacity=21474836480 size=8388608 hidden=0 parent=VHD-52b51de
                                  b-1c26-4101-a6b0-c2fde959c66b
                                                                         vhd=VHD-57304603-4a25-4723-93cf-24564d3e96f3 capacity=21474836480 size=264241152 hidden=1 parent=VHD-52b51
                                  deb-1c26-4101-a6b0-c2fde959c66b
                                                                            vhd=VHD-e28412b8-af76-4a72-85af-7780a97c7ae4 capacity=21474836480 size=8388608 hidden=0 parent=VHD-5730
                                  4603-4a25-4723-93cf-24564d3e96f3
                                                                            vhd=VHD-dedb2e9b-78e3-4067-92c9-42e3266c72cd capacity=21474836480 size=1946157056 hidden=1 parent=VHD-5
                                  7304603-4a25-4723-93cf-24564d3e96f3
                                                                               vhd=VHD-4b697f4c-2169-4d7c-a41a-718fd427745b capacity=21474836480 size=8388608 hidden=0 parent=VHD-d
                                  edb2e9b-78e3-4067-92c9-42e3266c72cd
                                                                               vhd=VHD-9ffc4f7c-c27a-4976-8c93-b1522822fab4 capacity=21474836480 size=21525168128 hidden=0 parent=V
                                  HD-dedb2e9b-78e3-4067-92c9-42e3266c72cd
                                  

                                  ~Peg

                                  1 Reply Last reply Reply Quote 0
                                  • olivierlambertO Offline
                                    olivierlambert Vates πŸͺ Co-Founder CEO
                                    last edited by

                                    IIRC, we got a protection on the same job to re-run on itself. But not between jobs.

                                    I'm a bit surprised that our VDI chain protection didn't prevent this scenario.

                                    1 Reply Last reply Reply Quote 0
                                    • AnonabharA Offline
                                      Anonabhar
                                      last edited by

                                      <looks sheepishly as he admits> It probably would have.... The Cont replication jobs were running on a opensource version (different appliance) while the incremental and full (monthly) exports were being run on a Enterprise version.

                                      Not sure if the VDI protection would pick up the other XOA jobs.

                                      ~Peg

                                      1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates πŸͺ Co-Founder CEO
                                        last edited by

                                        Mixing jobs in various XO isn't a good idea in any case.

                                        1 Reply Last reply Reply Quote 0
                                        • M Offline
                                          mbt
                                          last edited by

                                          For the records:

                                          After the vm copy that somehow seems to have triggered the coalescing the delta backup job and another weekly full backup job ran without any issues.

                                          Today is going to be interesting. The daily deltas only run weekdays (Mo-Fr), the full backup on Sundays. The deltas ran fine, the full backup afterwards, too. But now I again see stuck vdi coalesce jobs on the sr's advanced tab.
                                          Looks the same as last Monday. If it doesn't clear over the day I'll try the same trick from Friday and see if that works again.

                                          Interesting, that now twice in a row this happens after the full backup job.

                                          1 Reply Last reply Reply Quote 0
                                          • M Offline
                                            mbt
                                            last edited by

                                            Same as Friday. I started a copy of the vm and immediately the coalescing began and finished successfully shortly after.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post