Delta backup fails for specific vm with VDI chain error

mbt

Sure. Did it.

The depth in the sr's advanced tab now displays a depth of 3.

rigel: sr (27 VDIs)
├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
│ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
│   └─┬ customer server 2017 0 - 1d1efc9f-46e3-4b0d-b66c-163d1f262abb - 0.15 Gi
│     └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi

This is something new.. we may be on to something:

Aug 27 16:23:39 rigel SMGC: [11997] Num combined blocks = 255983
Aug 27 16:23:39 rigel SMGC: [11997] Coalesced size = 500.949G
Aug 27 16:23:39 rigel SMGC: [11997] Coalesce candidate: *775aa9af[VHD](500.000G//319.473G|ao) (tree height 3)
Aug 27 16:23:39 rigel SMGC: [11997] Coalescing *775aa9af[VHD](500.000G//319.473G|ao) -> *43454904[VHD](500.000G//500.949G|ao)

And after a while:

Aug 27 16:26:26 rigel SMGC: [11997] Removed vhd-blocks from *775aa9af[VHD](500.000G//319.473G|ao)
Aug 27 16:26:27 rigel SMGC: [11997] Set vhd-blocks = (omitted output) for *775aa9af[VHD](500.000G//319.473G|ao)
Aug 27 16:26:27 rigel SMGC: [11997] Set vhd-blocks = eJztzrENgDAAA8H9p/JooaAiVSQkTOCuc+Uf45RxdXc/bf6f99ulHVCWdsDHpR0ALEs7AF4s7QAAgJvSDoCNpR0AAAAAAAAAAAAAALCptAMAYEHaAQAAAAAA/FLaAQAAAAAAALCBA/4EhgU= for *43454904[VHD](500.000G//500.949G|ao)
Aug 27 16:26:27 rigel SMGC: [11997] Num combined blocks = 255983
Aug 27 16:26:27 rigel SMGC: [11997] Coalesced size = 500.949G

Depth is now down to 2 again.
xapi-explore --full now works, but looks the same to me:

rigel: sr (26 VDIs)
├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
│ └─┬ base copy - 775aa9af-f731-45e0-a649-045ab1983935 - 318.47 Gi
│   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi

It's busy coalescing. We'll see how that ends.

htop — 182×51 2019-08-27 16-30-47.png

olivierlambert

Yeah, 140MiB/s for coalesce is really not bad Let's see!

mbt

Hm...

rigel: sr (rigel) 2019-08-27 17-23-57.png

rigel: sr (26 VDIs)
├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
│ └─┬ customer server 2017 0 - 8e779c46-6692-4ed2-a83d-7d8b9833704c - 0.19 Gi
│   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi

olivierlambert

Yes, it's logical: 7ef76 is the active disk, and it should be merged in 8e77, then this last one should be merged in 4345

mbt

But that never seems to happen. It's always just merging the little VHD in the middle:

Aug 28 10:00:22 rigel SMGC: [11997] SR f951 ('rigel: sr') (26 VDIs in 9 VHD trees): showing only VHD trees that changed:
Aug 28 10:00:22 rigel SMGC: [11997]         *43454904[VHD](500.000G//500.949G|ao)
Aug 28 10:00:22 rigel SMGC: [11997]             *3378a834[VHD](500.000G//1.520G|ao)
Aug 28 10:00:22 rigel SMGC: [11997]                 7ef76d55[VHD](500.000G//500.984G|ao)
Aug 28 10:00:22 rigel SMGC: [11997]
Aug 28 10:00:22 rigel SMGC: [11997] Coalescing parent *3378a834[VHD](500.000G//1.520G|ao)

├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
│ └─┬ customer server 2017 0 - 3378a834-77d3-48e7-8532-ec107add3315 - 1.52 Gi
│   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi

Right before this timestamp and probably just by chance I got this:

├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
│ └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi

But still....

rigel: sr (rigel) 2019-08-28 10-03-00.png

olivierlambert

That's strange. The child is bigger than the parent. I wonder how it's possible but I forgot how the size is computed on LVM (I'm mainly using file backend).

You could try to do a vhd-util repair on those disks. See https://support.citrix.com/article/CTX217757

mbt

The bigger number is equal to the configured virtual disk size.

The repair seems to work only if a disk is not in use - eq offline:

[10:24 rigel ~]# lvchange -ay /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
[10:26 rigel ~]# vhd-util repair -n /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
[10:27 rigel ~]# lvchange -an /dev/mapper/VG_XenStorage--f951f048--dfcb--8bab--8339--463e9c9b708c-VHD--7ef76d55--683d--430f--91e6--39e5cceb9ec1 
  Logical volume VG_XenStorage-f951f048-dfcb-8bab-8339-463e9c9b708c/VHD-7ef76d55-683d-430f-91e6-39e5cceb9ec1 in use.

olivierlambert

Have you tried:

repair on both UUIDs in the chain?
trying again when it's halted

mbt

I tried what I did last week: I made a copy.

So I had the VM with no snapshot in the state descibed in my last posts. I triggered a full copy with zstd compression to the other host in XO.

The system created a VM snapshot and is currently in the process of copying.

Meanwhile the gc did some stuff and now says

Aug 28 11:19:27 rigel SMGC: [11997] GC process exiting, no work left
Aug 28 11:19:27 rigel SMGC: [11997] SR f951 ('rigel: sr') (25 VDIs in 9 VHD trees): no changes

xapi-explore-sr says:

rigel: sr (25 VDIs)
├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
│ ├── customer server 2017 0 - 16f83ba3-ef58-4ae0-9783-1399bb9dea51 - 0.01 Gi
│ └─┬ customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi
│   └── customer server 2017 0 - 16f83ba3-ef58-4ae0-9783-1399bb9dea51 - 0.01 Gi

Is it okay for 16f83ba3 to appear twice?

The sr's advanced tab in XO is empty.

olivierlambert

Sounds like the chain is fucked up in a way I never saw. But I'm not sure about what we see and what it's doing.

Ideally, can you reproduce this bug on a file level SR?

mbt

Hm.. I could move all vms on one host and add a couple of sas disks to the other, set up a file level sr and see how that's behaving. I just don't think I'll get it done this week.

P.S.: 16f83ba3 shows up only once in xapi-explore-sr, but twice in xapi-explore-sr --full

mbt

FYI, in the meantime the copy has finished, XO deleted the snapshot and now we're back at the start again:

rigel: sr (rigel) 2019-08-28 13-29-18.png

xapi-explore-sr (--full doesn't work at the moment wit "maximum call stack size exceeded" error):

├─┬ customer server 2017 0 - 43454904-e56b-4375-b2fb-40691ab28e12 - 500.95 Gi
│ └─┬ customer server 2017 0 - 57b0bec0-7491-472b-b9fe-e3a66d48e1b0 - 0.2 Gi
│   └── customer server 2017 0 - 7ef76d55-683d-430f-91e6-39e5cceb9ec1 - 500.98 Gi

P.S.:

Whilst migrating:

Aug 28 13:45:33 rigel SMGC: [5663] No work, exiting
Aug 28 13:45:33 rigel SMGC: [5663] GC process exiting, no work left
Aug 28 13:45:33 rigel SMGC: [5663] SR f951 ('rigel: sr') (25 VDIs in 9 VHD trees): no changes

So, yeah, foobar

mbt

Migrated the vm to the other host, waited and watched the logs, etc.
The behaviour stays the same. It's constantly coalescing but never gets to an end. Depth in the advanced tab stays at 2.

So I guess the next step will be to setup an additional ext3 sr.

P.S.:

You said "file level sr". So I could also use NFS, right?
Setting up NFS on the 10GE SSD NAS would indeed be easier than adding drives to a host...

Anonabhar

mbt,

That is the weirdest thing I have seen (and I think i hold the record of causing storage related problems 8-) ).

Look, I know this is going to sound weird, but try making a copy of the VM not by using the "copy" function. Create a disaster recovery backup job to copy the VM. The reason why I am suggesting this is because the it appears that XO creates a "stream" and effectively exports and imports the VM at the same time. I believe the copy function is handled very differently in XAPI. This should break all association between the old "borked" VDI and the new.

I would be really interested to see if that fixes the problem for you

~Peg

mbt

Hi Peg,

believe me, I'd rather not be after your record

Disaster recovery does not work for this vm: "Job canceled to protect the VDI chain"
My guess: as long as XO checks the VDI chain for potential problems before each VM's backup, no backup-ng mechanism will backup this vm.

olivierlambert

We have an option to force it, but this is a protection and that just display in plain sight your issue with this SR.

I still don't know the root cause, I think it will be hard to know more without accessing the host myself remotely and do a lot of digging.

mbt

I now have migrated the VM's disk to NFS.

xapi-explore-sr looks like this (--full not working most of the time):

NASa-sr (3 VDIs)
└─┬ customer server 2017 0 - 6d1b49d2-51e1-4ad4-9a3b-95012e356aa3 - 500.94 Gi
  └─┬ customer server 2017 0 - f2def08c-cf2e-4a85-bed8-f90bd11dd585 - 43.05 Gi
    └── customer server 2017 0 - 9ad7e1c4-b7cb-4a1b-bb74-6395444da2b7 - 43.04 Gi

NASa-sr (kirchhoff) 2019-08-29 14-12-45.png

On the NFS it looks like this:
NASa 2019-08-29 14-18-47.png

I'll wait a second for a coalesce job that is currently in progress...

OK, the job has finished. No VDIs to coalesce in the advanced tab.
NFS looks like this:

NASa 2019-08-29 14-20-44.png

xapi-explore-sr --full says:

NASa-sr (1 VDIs)
└── Customer server 2017 0 - 9ad7e1c4-b7cb-4a1b-bb74-6395444da2b7 - 43.04 Gi

olivierlambert

So coalesce is working as expected on the NFS SR. For some reason, your original LVM SR can't make a proper coalesce.

mbt

But only for this vm, all the others are running (and backupping) fine.
How are my chances that coalesce will work fine again if I migrate the disk back to local lvm sr?

olivierlambert

Try again now you got a clean chain