XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Delta backup fails for specific vm with VDI chain error

    Scheduled Pinned Locked Moved Xen Orchestra
    79 Posts 5 Posters 10.6k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Thanks! So here is the logic: leaf coalesce will (or should 😛 ) merge a base copy and its child ONLY if this base copy get only one child.

      Also, here is a good read: https://support.citrix.com/article/CTX201296

      You can check if your SR got leaf coalesce enabled, there's no reason to not have it, but still a check to do.

      M 1 Reply Last reply Reply Quote 0
      • _danielgurgel_ Offline
        _danielgurgel
        last edited by

        GC is an architectural problem in XenServer | CH. I've been fighting about this with Citrix for a long time and I never see this problem being in fact solved or a documentation of a troubleshooting that really works.

        For Enterprise Support | Premium, the procedure to be executed is always the FULL COPY of the VM, which is unfeasible in most cases for a problem that is so recurring.

        In CH8 (updated) I have the same problems and I started to have problem in other Pools 7.1 CU2 after installing XS71ECU2009. Until then the process came "stable for a while ", after installing I went back to having problems and unfortunately reinstalling and doing a rollback is not feasible... We opted to upgrade to CH8, but the problem remained...

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          Yeah that's why we are focusing on SMAPIv3 instead of trying to "fix" something that's probably flawed by design on SR with slow speed (in general, it works relatively well on SSDs)

          _danielgurgel_ 1 Reply Last reply Reply Quote 0
          • _danielgurgel_ Offline
            _danielgurgel @olivierlambert
            last edited by

            @olivierlambert said in Delta backup fails for specific vm with VDI chain error:

            SMAPIv3

            But @olivierlambert, in other posts, even with a FULL SSD disk (SC5020F) we have failed the coalesce process ... when you talk about SMAPIv3 is this something to be implemented exclusively in XCP or will it be inherited from CH 8.x?

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              @_danielgurgel if you have failed coalesce even on SSD, you should have something that cause the issue. Majority of users don't have this problem, so I suppose the thing is to find what could cause it.

              SMAPIv3 is done by Citrix, but we are doing stuff on our side (upstream as possible, harder since Citrix closed some sources). As soon we have something that people could test, we'll push it into testing 🙂

              1 Reply Last reply Reply Quote 0
              • M Offline
                mbt @olivierlambert
                last edited by

                @olivierlambert said in Delta backup fails for specific vm with VDI chain error:

                Thanks! So here is the logic: leaf coalesce will (or should 😛 ) merge a base copy and its child ONLY if this base copy get only one child.

                Also, here is a good read: https://support.citrix.com/article/CTX201296

                You can check if your SR got leaf coalesce enabled, there's no reason to not have it, but still a check to do.

                With "only one child" you mean no nested child (aka grandchild)?
                As I understand leaf-coalesce can be turned off explicitly and otherwise is on implicitely. It wasn't turned off.
                Only thing I could do (I guess) was turn it on explicitely - just to make sure.

                [15:31 rigel ~]# xe sr-param-get uuid=f951f048-dfcb-8bab-8339-463e9c9b708c param-name=other-config param-key=leaf-coalesce
                true
                

                Nothing has changed so far, so I guess I should go on and see what happens this time if I migrate the vm to the other host?

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Okay so it wasn't disabled, as it should.

                  To trigger a coalesce, you need to delete a snapshot. So it's trivial to test: create a snapshot, then remove it. Then you'll see a VDI that must be coalesce in Xen Orchestra.

                  To answer the question: doesn't matter if the child got child too. As long there is only one direct child, it means coalesce should be triggered.

                  1 Reply Last reply Reply Quote 0
                  • M Offline
                    mbt
                    last edited by

                    That doesn't seem to have an effect in the behaviour other then a bunch of new messages in the log.

                    I'll check in a couple of hours. If the behaviour persists I'll migrate the vm and we'll see how it behaves on the other host.

                    1 Reply Last reply Reply Quote 0
                    • olivierlambertO Offline
                      olivierlambert Vates 🪐 Co-Founder CEO
                      last edited by

                      Create a snap, display the chain with xapi-explore-sr. Then remove the snap, and check again. Something should have changed 🙂

                      1 Reply Last reply Reply Quote 0
                      • M Offline
                        mbt
                        last edited by

                        It changed from

                        rigel: sr (30 VDIs)
                        ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                        │ ├── customer server 2017 0 - dcdef81b-ec1a-481f-9c66-ea8a9f46b0c8 - 0.01 Gi
                        │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                        │   └─┬ customer server 2017 0 - d7204256-488d-4283-a991-8a59466e4f62 - 24.54 Gi
                        │     └─┬ base copy - 1578f775-4f53-4de4-a775-d94f04fbf701 - 0.05 Gi
                        │       ├── customer server 2017 0 - 8bcae3c3-15af-4c66-ad49-d76d516e211c - 0.01 Gi
                        │       └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                        

                        to

                        rigel: sr (29 VDIs)
                        ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                        │ ├── customer server 2017 0 - dcdef81b-ec1a-481f-9c66-ea8a9f46b0c8 - 0.01 Gi
                        │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                        │   └─┬ customer server 2017 0 - d7204256-488d-4283-a991-8a59466e4f62 - 24.54 Gi
                        │     └─┬ base copy - 1578f775-4f53-4de4-a775-d94f04fbf701 - 0.05 Gi
                        │       └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                        
                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by

                          Can you use --full because we can't have colors in copy/paste from your terminal 🙂

                          1 Reply Last reply Reply Quote 0
                          • M Offline
                            mbt
                            last edited by

                            A moment later it changed to

                            rigel: sr (28 VDIs)
                            ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                            │ ├── customer server 2017 0 - dcdef81b-ec1a-481f-9c66-ea8a9f46b0c8 - 0.01 Gi
                            │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                            │   └─┬ base copy - 1578f775-4f53-4de4-a775-d94f04fbf701 - 0.05 Gi
                            │     └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                            

                            Unfortunately I cannot do a --full, as it gives me an error:

                            ✖ Maximum call stack size exceeded
                            RangeError: Maximum call stack size exceeded
                                at assign (/usr/lib/node_modules/xapi-explore-sr/node_modules/human-format/index.js:21:19)
                                at humanFormat (/usr/lib/node_modules/xapi-explore-sr/node_modules/human-format/index.js:221:12)
                                at formatSize (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:66:36)
                                at makeVdiNode (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:230:60)
                                at /usr/lib/node_modules/xapi-explore-sr/dist/index.js:241:26
                                at /usr/lib/node_modules/xapi-explore-sr/dist/index.js:101:27
                                at arrayEach (/usr/lib/node_modules/xapi-explore-sr/node_modules/lodash/_arrayEach.js:15:9)
                                at forEach (/usr/lib/node_modules/xapi-explore-sr/node_modules/lodash/forEach.js:38:10)
                                at mapFilter (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:100:25)
                                at makeVdiNode (/usr/lib/node_modules/xapi-explore-sr/dist/index.js:238:15)
                            
                            
                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              Hmm strange. Can you try to remove all snapshots on this VM?

                              1 Reply Last reply Reply Quote 0
                              • M Offline
                                mbt
                                last edited by

                                Sure. Did it.

                                The depth in the sr's advanced tab now displays a depth of 3.

                                rigel: sr (27 VDIs)
                                ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                                │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                                │   └─┬ customer server 2017 0 - 1d1efc9f-46e3-4b0d-b66c-163d1f262abb - 0.15 Gi
                                │     └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                                

                                This is something new.. we may be on to something:

                                Aug 27 16:23:39 rigel SMGC: [11997] Num combined blocks = 255983
                                Aug 27 16:23:39 rigel SMGC: [11997] Coalesced size = 500.949G
                                Aug 27 16:23:39 rigel SMGC: [11997] Coalesce candidate: *775aa9af[VHD](500.000G//319.473G|ao) (tree height 3)
                                Aug 27 16:23:39 rigel SMGC: [11997] Coalescing *775aa9af[VHD](500.000G//319.473G|ao) -> *43454904[VHD](500.000G//500.949G|ao)
                                

                                And after a while:

                                Aug 27 16:26:26 rigel SMGC: [11997] Removed vhd-blocks from *775aa9af[VHD](500.000G//319.473G|ao)
                                Aug 27 16:26:27 rigel SMGC: [11997] Set vhd-blocks = (omitted output) for *775aa9af[VHD](500.000G//319.473G|ao)
                                Aug 27 16:26:27 rigel SMGC: [11997] Set vhd-blocks = eJztzrENgDAAA8H9p/JooaAiVSQkTOCuc+Uf45RxdXc/bf6f99ulHVCWdsDHpR0ALEs7AF4s7QAAgJvSDoCNpR0AAAAAAAAAAAAAALCptAMAYEHaAQAAAAAA/FLaAQAAAAAAALCBA/4EhgU= for *43454904[VHD](500.000G//500.949G|ao)
                                Aug 27 16:26:27 rigel SMGC: [11997] Num combined blocks = 255983
                                Aug 27 16:26:27 rigel SMGC: [11997] Coalesced size = 500.949G
                                

                                Depth is now down to 2 again.
                                xapi-explore --full now works, but looks the same to me:

                                rigel: sr (26 VDIs)
                                ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                                │ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
                                │   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                                

                                It's busy coalescing. We'll see how that ends.

                                htop — 182×51 2019-08-27 16-30-47.png

                                1 Reply Last reply Reply Quote 0
                                • olivierlambertO Offline
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by

                                  Yeah, 140MiB/s for coalesce is really not bad 😛 Let's see!

                                  1 Reply Last reply Reply Quote 0
                                  • M Offline
                                    mbt
                                    last edited by

                                    Hm...

                                    rigel: sr (rigel) 2019-08-27 17-23-57.png

                                    rigel: sr (26 VDIs)
                                    ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                                    │ └─┬ customer server 2017 0 - 8e779c46-6692-4ed2-a83d-7d8b9833704c - 0.19 Gi
                                    │   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                                    
                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates 🪐 Co-Founder CEO
                                      last edited by

                                      Yes, it's logical: 7ef76 is the active disk, and it should be merged in 8e77, then this last one should be merged in 4345

                                      1 Reply Last reply Reply Quote 0
                                      • M Offline
                                        mbt
                                        last edited by

                                        But that never seems to happen. It's always just merging the little VHD in the middle:

                                        Aug 28 10:00:22 rigel SMGC: [11997] SR f951 ('rigel: sr') (26 VDIs in 9 VHD trees): showing only VHD trees that changed:
                                        Aug 28 10:00:22 rigel SMGC: [11997]         *43454904[VHD](500.000G//500.949G|ao)
                                        Aug 28 10:00:22 rigel SMGC: [11997]             *3378a834[VHD](500.000G//1.520G|ao)
                                        Aug 28 10:00:22 rigel SMGC: [11997]                 7ef76d55[VHD](500.000G//500.984G|ao)
                                        Aug 28 10:00:22 rigel SMGC: [11997]
                                        Aug 28 10:00:22 rigel SMGC: [11997] Coalescing parent *3378a834[VHD](500.000G//1.520G|ao)
                                        
                                        ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                                        │ └─┬ customer server 2017 0 - 3378a834-77d3-48e7-8532-ec107add3315 - 1.52 Gi
                                        │   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                                        

                                        Right before this timestamp and probably just by chance I got this:

                                        ├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
                                        │ └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
                                        

                                        But still....

                                        rigel: sr (rigel) 2019-08-28 10-03-00.png

                                        1 Reply Last reply Reply Quote 0
                                        • olivierlambertO Offline
                                          olivierlambert Vates 🪐 Co-Founder CEO
                                          last edited by

                                          That's strange. The child is bigger than the parent. I wonder how it's possible but I forgot how the size is computed on LVM (I'm mainly using file backend).

                                          You could try to do a vhd-util repair on those disks. See https://support.citrix.com/article/CTX217757

                                          1 Reply Last reply Reply Quote 0
                                          • M Offline
                                            mbt
                                            last edited by mbt

                                            The bigger number is equal to the configured virtual disk size.

                                            The repair seems to work only if a disk is not in use - eq offline:

                                            [10:24 rigel ~]# lvchange -ay /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
                                            [10:26 rigel ~]# vhd-util repair -n /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
                                            [10:27 rigel ~]# lvchange -an /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
                                              Logical volume VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/VHD-7ef76d55-683d-430f-91e6-39e5cceb9ec1 in use.
                                            
                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post