XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Delta backup fails for specific vm with VDI chain error

    Scheduled Pinned Locked Moved Xen Orchestra
    79 Posts 5 Posters 23.4k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M Offline
      mbt @olivierlambert
      last edited by

      @olivierlambert said in Delta backup fails for specific vm with VDI chain error:

      Thanks! So here is the logic: leaf coalesce will (or should šŸ˜› ) merge a base copy and its child ONLY if this base copy get only one child.

      Also, here is a good read: https://support.citrix.com/article/CTX201296

      You can check if your SR got leaf coalesce enabled, there's no reason to not have it, but still a check to do.

      With "only one child" you mean no nested child (aka grandchild)?
      As I understand leaf-coalesce can be turned off explicitly and otherwise is on implicitely. It wasn't turned off.
      Only thing I could do (I guess) was turn it on explicitely - just to make sure.

      [15:31 rigel ~]# xe sr-param-get uuid=f951f048-dfcb-8bab-8339-463e9c9b708c param-name=other-config param-key=leaf-coalesce
      true
      

      Nothing has changed so far, so I guess I should go on and see what happens this time if I migrate the vm to the other host?

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Okay so it wasn't disabled, as it should.

        To trigger a coalesce, you need to delete a snapshot. So it's trivial to test: create a snapshot, then remove it. Then you'll see a VDI that must be coalesce in Xen Orchestra.

        To answer the question: doesn't matter if the child got child too. As long there is only one direct child, it means coalesce should be triggered.

        1 Reply Last reply Reply Quote 0
        • M Offline
          mbt
          last edited by

          That doesn't seem to have an effect in the behaviour other then a bunch of new messages in the log.

          I'll check in a couple of hours. If the behaviour persists I'll migrate the vm and we'll see how it behaves on the other host.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Online
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Create a snap, display the chain with xapi-explore-sr. Then remove the snap, and check again. Something should have changed šŸ™‚

            1 Reply Last reply Reply Quote 0
            • M Offline
              mbt
              last edited by

              It changed from

              rigel: sr (30 VDIs)
              ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
              │ ā”œā”€ā”€ customer server 2017 0 - dcdef81b-ec1a-481f-9c66-ea8a9f46b0c8 - 0.01 Gi
              │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
              │   └─┬ customer server 2017 0 - d7204256-488d-4283-a991-8a59466e4f62 - 24.54 Gi
              │     └─┬ base copy - 1578f775-4f53-4de4-a775-d94f04fbf701 - 0.05 Gi
              │       ā”œā”€ā”€ customer server 2017 0 - 8bcae3c3-15af-4c66-ad49-d76d516e211c - 0.01 Gi
              │       └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
              

              to

              rigel: sr (29 VDIs)
              ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
              │ ā”œā”€ā”€ customer server 2017 0 - dcdef81b-ec1a-481f-9c66-ea8a9f46b0c8 - 0.01 Gi
              │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
              │   └─┬ customer server 2017 0 - d7204256-488d-4283-a991-8a59466e4f62 - 24.54 Gi
              │     └─┬ base copy - 1578f775-4f53-4de4-a775-d94f04fbf701 - 0.05 Gi
              │       └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
              
              1 Reply Last reply Reply Quote 0
              • olivierlambertO Online
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Can you use --full because we can't have colors in copy/paste from your terminal šŸ™‚

                1 Reply Last reply Reply Quote 0
                • M Offline
                  mbt
                  last edited by

                  A moment later it changed to

                  rigel: sr (28 VDIs)
                  ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                  │ ā”œā”€ā”€ customer server 2017 0 - dcdef81b-ec1a-481f-9c66-ea8a9f46b0c8 - 0.01 Gi
                  │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                  │   └─┬ base copy - 1578f775-4f53-4de4-a775-d94f04fbf701 - 0.05 Gi
                  │     └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                  

                  Unfortunately I cannot do a --full, as it gives me an error:

                  āœ– Maximum call stack size exceeded
                  RangeError: Maximum call stack size exceeded
                      at assign (/usr/lib/node_modules/xapi-explore-sr/node_modules/human-format/index.js:21:19)
                      at humanFormat (/usr/lib/node_modules/xapi-explore-sr/node_modules/human-format/index.js:221:12)
                      at formatSize (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:66:36)
                      at makeVdiNode (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:230:60)
                      at /usr/lib/node_modules/xapi-explore-sr/dist/index.js:241:26
                      at /usr/lib/node_modules/xapi-explore-sr/dist/index.js:101:27
                      at arrayEach (/usr/lib/node_modules/xapi-explore-sr/node_modules/lodash/_arrayEach.js:15:9)
                      at forEach (/usr/lib/node_modules/xapi-explore-sr/node_modules/lodash/forEach.js:38:10)
                      at mapFilter (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:100:25)
                      at makeVdiNode (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:238:15)
                  
                  
                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Online
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    Hmm strange. Can you try to remove all snapshots on this VM?

                    1 Reply Last reply Reply Quote 0
                    • M Offline
                      mbt
                      last edited by

                      Sure. Did it.

                      The depth in the sr's advanced tab now displays a depth of 3.

                      rigel: sr (27 VDIs)
                      ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                      │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                      │   └─┬ customer server 2017 0 - 1d1efc9f-46e3-4b0d-b66c-163d1f262abb - 0.15 Gi
                      │     └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                      

                      This is something new.. we may be on to something:

                      Aug 27 16:23:39 rigel SMGC: [11997] Num combined blocks = 255983
                      Aug 27 16:23:39 rigel SMGC: [11997] Coalesced size = 500.949G
                      Aug 27 16:23:39 rigel SMGC: [11997] Coalesce candidate: *775aa9af[VHD](500.000G//319.473G|ao) (tree height 3)
                      Aug 27 16:23:39 rigel SMGC: [11997] Coalescing *775aa9af[VHD](500.000G//319.473G|ao) -> *43454904[VHD](500.000G//500.949G|ao)
                      

                      And after a while:

                      Aug 27 16:26:26 rigel SMGC: [11997] Removed vhd-blocks from *775aa9af[VHD](500.000G//319.473G|ao)
                      Aug 27 16:26:27 rigel SMGC: [11997] Set vhd-blocks = (omitted output) for *775aa9af[VHD](500.000G//319.473G|ao)
                      Aug 27 16:26:27 rigel SMGC: [11997] Set vhd-blocks = eJztzrENgDAAA8H9p/JooaAiVSQkTOCuc+Uf45RxdXc/bf6f99ulHVCWdsDHpR0ALEs7AF4s7QAAgJvSDoCNpR0AAAAAAAAAAAAAALCptAMAYEHaAQAAAAAA/FLaAQAAAAAAALCBA/4EhgU= for *43454904[VHD](500.000G//500.949G|ao)
                      Aug 27 16:26:27 rigel SMGC: [11997] Num combined blocks = 255983
                      Aug 27 16:26:27 rigel SMGC: [11997] Coalesced size = 500.949G
                      

                      Depth is now down to 2 again.
                      xapi-explore --full now works, but looks the same to me:

                      rigel: sr (26 VDIs)
                      ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                      │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                      │   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                      

                      It's busy coalescing. We'll see how that ends.

                      htop — 182Ɨ51 2019-08-27 16-30-47.png

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Online
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Yeah, 140MiB/s for coalesce is really not bad šŸ˜› Let's see!

                        1 Reply Last reply Reply Quote 0
                        • M Offline
                          mbt
                          last edited by

                          Hm...

                          rigel: sr (rigel) 2019-08-27 17-23-57.png

                          rigel: sr (26 VDIs)
                          ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                          │ └─┬ customer server 2017 0 - 8e779c46-6692-4ed2-a83d-7d8b9833704c - 0.19 Gi
                          │   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                          
                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Online
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            Yes, it's logical: 7ef76 is the active disk, and it should be merged in 8e77, then this last one should be merged in 4345

                            1 Reply Last reply Reply Quote 0
                            • M Offline
                              mbt
                              last edited by

                              But that never seems to happen. It's always just merging the little VHD in the middle:

                              Aug 28 10:00:22 rigel SMGC: [11997] SR f951 ('rigel: sr') (26 VDIs in 9 VHD trees): showing only VHD trees that changed:
                              Aug 28 10:00:22 rigel SMGC: [11997]         *43454904[VHD](500.000G//500.949G|ao)
                              Aug 28 10:00:22 rigel SMGC: [11997]             *3378a834[VHD](500.000G//1.520G|ao)
                              Aug 28 10:00:22 rigel SMGC: [11997]                 7ef76d55[VHD](500.000G//500.984G|ao)
                              Aug 28 10:00:22 rigel SMGC: [11997]
                              Aug 28 10:00:22 rigel SMGC: [11997] Coalescing parent *3378a834[VHD](500.000G//1.520G|ao)
                              
                              ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                              │ └─┬ customer server 2017 0 - 3378a834-77d3-48e7-8532-ec107add3315 - 1.52 Gi
                              │   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                              

                              Right before this timestamp and probably just by chance I got this:

                              ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                              │ └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                              

                              But still....

                              rigel: sr (rigel) 2019-08-28 10-03-00.png

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Online
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                That's strange. The child is bigger than the parent. I wonder how it's possible but I forgot how the size is computed on LVM (I'm mainly using file backend).

                                You could try to do a vhd-util repair on those disks. See https://support.citrix.com/article/CTX217757

                                1 Reply Last reply Reply Quote 0
                                • M Offline
                                  mbt
                                  last edited by mbt

                                  The bigger number is equal to the configured virtual disk size.

                                  The repair seems to work only if a disk is not in use - eq offline:

                                  [10:24 rigel ~]# lvchange -ay /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
                                  [10:26 rigel ~]# vhd-util repair -n /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
                                  [10:27 rigel ~]# lvchange -an /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
                                    Logical volume VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/VHD-7ef76d55-683d-430f-91e6-39e5cceb9ec1 in use.
                                  
                                  1 Reply Last reply Reply Quote 0
                                  • olivierlambertO Online
                                    olivierlambert Vates 🪐 Co-Founder CEO
                                    last edited by

                                    Have you tried:

                                    • repair on both UUIDs in the chain?
                                    • trying again when it's halted
                                    1 Reply Last reply Reply Quote 0
                                    • M Offline
                                      mbt
                                      last edited by

                                      I tried what I did last week: I made a copy.

                                      So I had the VM with no snapshot in the state descibed in my last posts. I triggered a full copy with zstd compression to the other host in XO.

                                      The system created a VM snapshot and is currently in the process of copying.

                                      Meanwhile the gc did some stuff and now says

                                      Aug 28 11:19:27 rigel SMGC: [11997] GC process exiting, no work left
                                      Aug 28 11:19:27 rigel SMGC: [11997] SR f951 ('rigel: sr') (25 VDIs in 9 VHD trees): no changes
                                      

                                      xapi-explore-sr says:

                                      rigel: sr (25 VDIs)
                                      ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                                      │ ā”œā”€ā”€ customer server 2017 0 - 16f83ba3-ef58-4ae0-9783-1399bb9dea51 - 0.01 Gi
                                      │ └─┬ customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                                      │   └── customer server 2017 0 - 16f83ba3-ef58-4ae0-9783-1399bb9dea51 - 0.01 Gi
                                      

                                      Is it okay for 16f83ba3 to appear twice?

                                      The sr's advanced tab in XO is empty.

                                      1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Online
                                        olivierlambert Vates 🪐 Co-Founder CEO
                                        last edited by

                                        Sounds like the chain is fucked up in a way I never saw. But I'm not sure about what we see and what it's doing.

                                        Ideally, can you reproduce this bug on a file level SR?

                                        1 Reply Last reply Reply Quote 0
                                        • M Offline
                                          mbt
                                          last edited by mbt

                                          Hm.. I could move all vms on one host and add a couple of sas disks to the other, set up a file level sr and see how that's behaving. I just don't think I'll get it done this week.

                                          P.S.: 16f83ba3 shows up only once in xapi-explore-sr, but twice in xapi-explore-sr --full

                                          1 Reply Last reply Reply Quote 0
                                          • M Offline
                                            mbt
                                            last edited by mbt

                                            FYI, in the meantime the copy has finished, XO deleted the snapshot and now we're back at the start again:

                                            rigel: sr (rigel) 2019-08-28 13-29-18.png

                                            xapi-explore-sr (--full doesn't work at the moment wit "maximum call stack size exceeded" error):

                                            ā”œā”€ā”¬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                                            │ └─┬ customer server 2017 0 - 57b0bec0-7491-472b-b9fe-e3a66d48e1b0 - 0.2 Gi
                                            │   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                                            

                                            P.S.:

                                            Whilst migrating:

                                            Aug 28 13:45:33 rigel SMGC: [5663] No work, exiting
                                            Aug 28 13:45:33 rigel SMGC: [5663] GC process exiting, no work left
                                            Aug 28 13:45:33 rigel SMGC: [5663] SR f951 ('rigel: sr') (25 VDIs in 9 VHD trees): no changes
                                            

                                            So, yeah, foobar šŸ˜‰

                                            1 Reply Last reply Reply Quote 0

                                            Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                                            Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                                            With your input, this post could be even better šŸ’—

                                            Register Login
                                            • First post
                                              Last post