Every VM in a CR backup job creates an "Unhealthy VDI"

joeymorin

Greetings,

I'm experimenting with CR backups in a test environment. I have a nightly CR backup job, currently for 4 VMs, all going to the same SR, '4TB on antoni'. On the first incremental (second backup after the initial full) an unhealthy VDI is reported under dashboard/health... one for every VM in the job. A subsequent incremental result in an additional reported unhealthy VDI, again one for each VM.

For example:

The following VMs each currently have the initial full, and three subsequent incrementals in the CR chain:

HR-FS
maryjane
zuul

Note that there are three reported unhealthy VDIs for each.

The remaining VM, exocomp, currently has only 1 incremental after the initial full, and there is one reported unhealthy VDI for that VM.

Is this normal? If not, what details can I provide that might help get to the bottom of this?

Andrew

@joeymorin That's correct. They need time to coalesce after snapshots change. Length of 1 is normal. They should clear up after a few minutes.

joeymorin

@Andrew, they do not clear up. Please read my OP carefully and look at the screenshot. They remain forever. They accumulate, one for each VM for every incremental. Nightly CR, four VMs, four more unhealthy VDIs. Tomorrow night, four more, etc.

olivierlambert

We made some fixes very recently (yesterday), can you check you are on latest commit? (if XO sources)

joeymorin

I rebuild XO nightly at 11:25 UTC.

These fixes, would they stop the accumulation of unhealthy VDIs for existing CR chains already manifesting them? Or should I purge all of the CR VMs and snapshots?

As I type, I'm on 2d066, which is the latest. The CR job runs at 02:00 UTC, so had just run when I posted my OP. All of the unhealthy VDIs reported then are still reported now.

olivierlambert

You have to check your host SMlog to see if you have a coalesce issue

joeymorin

Three separate hosts are involved. HR-FS and zuul are on one, maryjane on the second, exocomp on the third.

Total, over 17,000 lines in SMlog for the hour during the CR job. No errors, no corruptions, no exceptions.

Actually, there are some reported exceptions and corruptions on farmer, but none that involve these VMs or this CR job. A fifth VM not part of the job has a corruption that I'm still investigating, but it's on a test VM I don't care about. The VM HR-FS does have a long-standing coalesce issue where two .vhd files always remain, the logs showing:

FAILED in util.pread: (rc 22) stdout: '/var/run/sr-mount/7bc12cff- ... -ce096c635e66.vhd not created by xen; resize not supported

... but this long predates the CR job, and seems related to the manner in which the original .vhd file was created on the host. It doesn't seem relevant, since three other VMs with no history of exceptions/errors in SMlog are showing the same unhealthy VDI behaviour, and two of those aren't even on the same host. One is on a separate pool.

SMlog is thick and somewhat inscrutable to me. Is there a specific message I should be looking for?

olivierlambert

Can you grep on the word exception? (with -i to make sure you get them all)

joeymorin

[09:24 farmer ~]# zcat /var/log/SMlog.{31..2}.gz | cat - /var/log/SMlog.1 /var/log/SMlog | grep -i "nov 12 21" | grep -i -e exception -e e.x.c.e.p.t.i.o.n

Nov 12 21:12:51 farmer SMGC: [17592]          *  E X C E P T I O N  *
Nov 12 21:12:51 farmer SMGC: [17592] coalesce: EXCEPTION <class 'util.CommandException'>, Invalid argument
Nov 12 21:12:51 farmer SMGC: [17592]     raise CommandException(rc, str(cmdlist), stderr.strip())
Nov 12 21:16:52 farmer SMGC: [17592]          *  E X C E P T I O N  *
Nov 12 21:16:52 farmer SMGC: [17592] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VHD *6c411334(8.002G/468.930M) corrupted
Nov 12 21:16:52 farmer SMGC: [17592]     raise util.SMException("VHD %s corrupted" % self)
Nov 12 21:16:54 farmer SMGC: [17592]          *  E X C E P T I O N  *
Nov 12 21:16:54 farmer SMGC: [17592] coalesce: EXCEPTION <class 'util.SMException'>, VHD *6c411334(8.002G/468.930M) corrupted
Nov 12 21:16:54 farmer SMGC: [17592]     raise util.SMException("VHD %s corrupted" % self)

None relevant to the CR job. The one at 21:12:51 local time is related to the 'resize not supported' issue I mention above. The two at 21:16:52 and 21:16:54 are related to a fifth VM not in the CR job (the test VM I don't care about, but may continue to investigate).

The other two hosts' SMlog are clean.

acebmxer

@joeymorin

If its any help check out my post - https://xcp-ng.org/forum/topic/11525/unhealthy-vdis/4