Posts made by dthenot | XCP-ng and XO forum

dthenot

@olivierlambert In 8.2 yes, linstor sm version is separated, it's not the case in 8.3 anymore.

dthenot

@Razor_648 While I was writing my previous message, I have been reminded that there are also issues with LVHDoISCSI SR and CBT, you should disable CBT on your backup job and on all VDI on the SR. It might help with the issue.

dthenot

@Razor_648 Hi,

The log you showed only mean that it couldn't compare two VDI together using their CBT.
It sometimes happen that a CBT chain become disconnected.

Disabling leaf-coalesce mean it won't run on leaf, VHD chain will always be 2 depths deep.

You migrated 200 VMs, every disks of those VMs had snapshot made that then need to be coalesced, it can take a while.
Your backup then also do a snapshot each time while running that need to be coalesced.

There are GC in both version of XCP-ng 8.2 and 8.3.
The GC is run independently of auto-scan, if you really want to disable it you can do it temporarily using /opt/xensource/sm/cleanup.py -x -u <SR UUID> it will stop the GC until you press enter. I guess you could run it in a tmux to make it stop until next reboot. But it would be better to find the problem or if there really is no problem let the GC work until it's finished.
It's a bit weird to need 15 minutes to take a snapshot, it would point to a problem though.
Do you have any other error than the CBT one in your SMlog?

dthenot

@manilx Hi,

yum install plug-late-sr

Should do the trick to install it

dthenot

@bufanda You just need to make sure to have a sm and blktap qcow2 version.
Otherwise, having a normal sm version would drop the QCOW2 VDI from the XAPI database and you would lose VBD to VM aswell as name of VDI.
So it could be painful depending on how much you have
But in the case, you would install a non QCOW2 sm version, you would only lose the QCOW2 from DB, those would not be deleted or anything. Reinstalling a QCOW2 version then rescanning the SR would make them re-appear. But then you would have to identify them again (lost name-label) and relink them to VM.
We try to keep our QCOW2 version on top of the testing branch of XCP-ng but we could miss it

dthenot

@bufanda Hello,

There is equivalent sm packages in the qcow2 repo for testing, XAPI will be coming soon.
You can update while enabling the QCOW2 repo to get the sm and blktap QCOW2 version and get the XAPI version letter if you want.

dthenot

@JeffBerntsen That's why I meant, the way to install written in the first post still work in 8.3, the script still work as expected also, it basically only create the VG/LV needed on hosts before you create the SR.

dthenot

@gb.123 Hello,
The instruction in the first post are still the way to go

dthenot

@nvoss said in VDI Chain on Deltas:

What would make the force restart work when the scheduled regular runs dont?

I'm not sure what you mean.
The backup need to do a snapshot to have a point to compare before exporting data.
This snapshot will create a new level of VHD that would need to be coalesced, but it's limiting the number of VHD in the chain so it fails.
This is caused by the fact that the garbage collector can't run because it can't edit the corrupted VDI.
Since there is a corrupted VDI it's not running to not create more problem on the VDI chains.
Sometime corruption mean that we don't know if a VHD has any parent for example, and if doing so we can't know what the chain looks like meaning not knowing what VHD are in what chain in the SR (Storage Repository).

VDI: Virtual Disk Image in this context
VHD being the format of VDI we use at the moment in XCP-ng

After removing the corrupted VDI, maybe automatically by the migration process (maybe you'll have to do it by hand), you can run a sr-scan on the SR and it launch the GC again.

dthenot

@nvoss No, the GC is blocked because only one VDI is corrupted, the one with the check.
All other VDI are on a long chain because they couldn't coalesce.
Sorry, BATMAP is the block allocation table, it's the info of the VHD to know which block exist locally.
Migrating the VDI might work indeed, I can't really be sure.

dthenot

@nvoss The VHD is reported corrupted on the batmap. You can try to repair it with vhd-util repair but it'll likely not work.
I have seen people recover from this kind of error by doing a vdi-copy.
You could try a VM copy or a VDI copy and link the VDI to the VM again and see if it's alright.
The corrupted VDI is blocking the garbage collector so the chain are long and that's the error you see on XO side.
It might be needed to remove the chain by hand to resolve the issue.

dthenot

@nvoss Could you try to run vhd-util check -n /var/run/sr-mount/f23aacc2-d566-7dc6-c9b0-bc56c749e056/3a3e915f-c903-4434-a2f0-cfc89bbe96bf.vhd?

dthenot

@nvoss Hello, The UNDO LEAF-COEALESCE usually has a cause that is listed in the error above it. Could you share this part please?

dthenot

@yllar Maybe it was because of the loopdevice not being completely created indeed.
No error for this GC run.

Everything should be ok then

dthenot

@yllar

Sorry, I missed the first ping.

May  2 08:31:40 a1 SM: [18985] ['/sbin/vgs', '--readonly', 'VG_XenStorage-07ab18c4-a76f-d1fc-4374-babfe21fd679']
May  2 08:32:24 a1 SM: [18985]   pread SUCCESS
May  2 08:32:24 a1 SM: [18985] ***** Long LVM call of 'vgs' took 43.6255850792

That would explain why it took a long time to create. 43 seconds for a call to vgs.
Can you try to do a vgs call yourself on your host?
Does it take a long time?

This exception is "normal":

May  2 08:32:25 a1 SMGC: [19336] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
May  2 08:32:25 a1 SMGC: [19336]          ***********************
May  2 08:32:25 a1 SMGC: [19336]          *  E X C E P T I O N  *
May  2 08:32:25 a1 SMGC: [19336]          ***********************
May  2 08:32:25 a1 SMGC: [19336] gc: EXCEPTION <class 'util.SMException'>, SR 42535e39-4c98-22c6-71eb-303caa3fc97b not attached on this host
May  2 08:32:25 a1 SMGC: [19336]   File "/opt/xensource/sm/cleanup.py", line 3388, in gc
May  2 08:32:25 a1 SMGC: [19336]     _gc(None, srUuid, dryRun)
May  2 08:32:25 a1 SMGC: [19336]   File "/opt/xensource/sm/cleanup.py", line 3267, in _gc
May  2 08:32:25 a1 SMGC: [19336]     sr = SR.getInstance(srUuid, session)
May  2 08:32:25 a1 SMGC: [19336]   File "/opt/xensource/sm/cleanup.py", line 1552, in getInstance
May  2 08:32:25 a1 SMGC: [19336]     return FileSR(uuid, xapi, createLock, force)
May  2 08:32:25 a1 SMGC: [19336]   File "/opt/xensource/sm/cleanup.py", line 2334, in __init__
May  2 08:32:25 a1 SMGC: [19336]     SR.__init__(self, uuid, xapi, createLock, force)
May  2 08:32:25 a1 SMGC: [19336]   File "/opt/xensource/sm/cleanup.py", line 1582, in __init__
May  2 08:32:25 a1 SMGC: [19336]     raise util.SMException("SR %s not attached on this host" % uuid)
May  2 08:32:25 a1 SMGC: [19336]
May  2 08:32:25 a1 SMGC: [19336] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
May  2 08:32:25 a1 SMGC: [19336] * * * * * SR 42535e39-4c98-22c6-71eb-303caa3fc97b: ERROR
May  2 08:32:25 a1 SMGC: [19336]

It's the garbage collector trying to run on the SR but it is in the process of attaching.
It's weird though because it's the call to sr_attach that launched the GC.
Does the GC run normally on this SR on next attempts?

Otherwise, I don't see anything worrying the logs you shared.
It should be safe to use

dthenot

@cmd Hello,

It's described here in the documentation https://docs.xcp-ng.org/xostor/#map-linstor-resource-names-to-xapi-vdi-uuids
It might be possible to add a parameter in the sm-config of the VDI to ease this link, I'll put a card in our backlog to see if it's doable.

dthenot

@FMOTrust Hello,

Good news you found the problem.
Yes, in XCP-ng 8.3 python should point to a 2.7.5 version while python3 will point to 3.6.8 at the moment.
I imagine you are on 8.2.1 though since the smapi is running in python 3 in 8.3.
While the smapi is python2 only on 8.2.1 and so will expect python to point to the 2.7.5 version.

dthenot

@FMOTrust Hello,

Could you give us the output of yum info sm please?

dthenot

For people testing the QCOW2 preview, please be informed that you need to update with the QCOW2 repo enabled, if you install the new non QCOW2 version, you risk QCOW2 VDI being dropped from XAPI database until you have installed it and re-scanned the SR.
Dropping from XAPI means losing name-label, description and worse, the links to a VM for these VDI.
There should be a blktap, sm and sm-fairlock update of the same version as above in the QCOW2 repo.

If you have correctly added the QCOW2 repo linked here: https://xcp-ng.org/forum/post/90287

You can update like this:

yum clean metadata --enablerepo=xcp-ng-testing,xcp-ng-qcow2
yum update --enablerepo=xcp-ng-testing,xcp-ng-qcow2
reboot

Versions:

blktap: 3.55.4-1.1.0.qcow2.1.xcpng8.3
sm: 3.2.12-3.1.0.qcow2.1.xcpng8.3

dthenot

Hi, this XAPI plugin multi is called on another host but is failing with IOError.
It's doing a few things on a host related to LVM handling.
It's failing on one of them, you should look into the one having the error to have the full error in SMlog of the host.
The plugin itself is located in /etc/xapi.d/plugins/on-slave, it's the function named multi.