VDI_IO_ERROR Continuous Replication on clean install.
-
yes i think im dumb. waiting when you will look at logs which i not provided
i see nothing interesting except one repeatable "Failed to lock /var/lock/sm/.nil/lvm on first attempt, blocked by PID 25021"
and task duration before fail always 5min. i think that some hardcoded timers?
Start: Dec 16, 2022, 12:44:50 PM
End: Dec 16, 2022, 12:50:15 PM
part 1 https://pastebin.com/xZAXEiq1
part 2 https://pastebin.com/Lmhermgx -
-
@Tristis-Oris Hey man, were you able to solve this? I'm facing the same issue after a reinstall. Continuous Replication fails exactly at 5 minutes.
Thanks in advance -
@yomono nope still investigating. i got
SR_NOT_SUPPORTED
at log. -
-
Can you double check you are using a recent commit on
master
? -
@olivierlambert In my case, I'm indeed using the latest commit. I played around with old commits yesterday (as old as two or three months) but same result. Right now, I'm using the latest (commited an hour ago).
I can share my SMlogs if you want but I'm also getting the "SR_NOT SUPPORTED" error. I tried to backup different VMs on different sources servers, and to different servers destinations. My next try will be reinstalling XO -
Indeed, try to wipe it entirely, and rebuild.
-
@olivierlambert that a old problem) but yes, usually about latest. Just repeated all tests on
3c7d3
.- CR stop working right after clean 8.2.1 installation. at same day.
- replaced FC to iscsi (only bcz i need iscsi here ), created new LUN > same story.
- 8.2.0 clean install, no updated - CR works.
- don't tried usual backups to this SR, don't need them here.
- VM migration works at any setup.
problem only with one storage
Dell EMC PowerVault ME4012
. all other huawei, iscsi - works fine. But not sure if i have another clean 8.2.1 pools. Maybe only some nodes.logs now. I'm doing 2 CR backups.
- Xen 8.2.1, clean host, no any VM. iscsi 60Tb lun. Xen show only 50Tb
Jan 16 15:15:30 test SMGC: [10010] SR f3fd ('LUN') (2 VDIs in 2 VHD trees): Jan 16 15:15:30 test SMGC: [10010] a459d14f[VHD](50.000G//50.105G|ao) Jan 16 15:15:30 test SMGC: [10010] 1789f7a7[VHD](50.000G//50.105G|ao) Jan 16 15:15:47 test SMGC: [10338] SR f3fd ('LUN') (1 VDIs in 1 VHD trees): Jan 16 15:15:47 test SMGC: [10338] a459d14f[VHD](50.000G//50.105G|ao) Jan 16 15:15:47 test SMGC: [10338]
here it 60Tb.
2 vms, 2 error -
SR_NOT_SUPPORTED
Jan 16 15:10:43 test SM: [6596] result: {'params_nbd': 'nbd:unix:/run/blktap-control/nbd/f3fd46f7-5ce4-e5e0-53e9-059ce4775a7b/1789f7a7-05a0-411c-aa80-dcc659f8b45f', 'o_direct_reason': 'SR_NOT_SUPPORTED', 'params': '/dev/sm/backend/f3fd46f7-5ce4-e5e0-53e9-059ce4775a7b/1789f7a7-05a0-411c-aa80-dcc659f8b45f', 'o_direct': True, 'xenstore_data': {'scsi/0x12/0x80': 'AIAAEjE3ODlmN2E3LTA1YTAtNDEgIA==', 'scsi/0x12/0x83': 'AIMAMQIBAC1YRU5TUkMgIDE3ODlmN2E3LTA1YTAtNDExYy1hYTgwLWRjYzY1OWY4YjQ1ZiA=', 'vdi-uuid': '1789f7a7-05a0-411c-aa80-dcc659f8b45f', 'mem-pool': 'f3fd46f7-5ce4-e5e0-53e9-059ce4775a7b'}} Jan 16 15:10:49 test SM: [6834] result: {'params_nbd': 'nbd:unix:/run/blktap-control/nbd/f3fd46f7-5ce4-e5e0-53e9-059ce4775a7b/a459d14f-ae92-4a77-8574-30442126624b', 'o_direct_reason': 'SR_NOT_SUPPORTED', 'params': '/dev/sm/backend/f3fd46f7-5ce4-e5e0-53e9-059ce4775a7b/a459d14f-ae92-4a77-8574-30442126624b', 'o_direct': True, 'xenstore_data': {'scsi/0x12/0x80': 'AIAAEmE0NTlkMTRmLWFlOTItNGEgIA==', 'scsi/0x12/0x83': 'AIMAMQIBAC1YRU5TUkMgIGE0NTlkMTRmLWFlOTItNGE3Ny04NTc0LTMwNDQyMTI2NjI0YiA=', 'vdi-uuid': 'a459d14f-ae92-4a77-8574-30442126624b', 'mem-pool': 'f3fd46f7-5ce4-e5e0-53e9-059ce4775a7b'}}
and some small like
Jan 16 15:14:30 test SM: [9387] Failed to lock /var/lock/sm/.nil/lvm on first attempt, blocked by PID 9357 Line 184: Jan 16 15:10:29 test SM: [6376] Failed to lock /var/lock/sm/.nil/lvm on first attempt, blocked by PID 6348 Line 571: Jan 16 15:10:59 test SM: [7141] Failed to lock /var/lock/sm/.nil/lvm on first attempt, blocked by PID 7115 Line 717: Jan 16 15:11:30 test SM: [7457] Failed to lock /var/lock/sm/.nil/lvm on first attempt, blocked by PID 7428 Line 1927: Jan 16 15:15:43 test SM: [10146] unlink of attach_info failed
nothing more with error status at log.
-
because of weird size, tried 8.2.1 with iscsi 20Tb LUN.
same resultSR_NOT_SUPPORTED
. -
now 8.2.0 clean install, no any updates.
it works.
here both hosts connected to same SR.
- 8.2.0 full updates > 8.2.1 release/yangtze/master/58
CR still working.
4.1. unmount LUN, mount again.
working. -
I have no idea, sorry. So to recap:
- doesn't work on a 8.2.1 fresh install with updates
- works on older 8.2.0
- work on 8.2.0 updated to 8.2.1
It doesn't sound like an XO bug in your case.
-
@olivierlambert yes. so what to do next?)
-
Trying to figure the setup so we can try to reproduce, and also switching various things until there's a clear pattern.
Eg: can you try with an NFS share to see if you have the same issue? If it's iSCSI related, that would help us to investigate.
-
i can't, this is only SAN storage.
any point to test on 8.3 alpha? -
You can, it's still another test that might help us to pinpoint something
-
@olivierlambert I would like to add that after this recap I realized... I also had to reinstall XCP so in my case it's also a fresh 8.2.1 install! At least. knowing that, I can do a 8.2.0 + upgrade installation.. (that's what I used to have). I can also try 8.3 alpha, it's not like I have anything to lose at this point (that server is only to contain XO, there is nothing else there)
Anyways.. the fresh 8.2.1 install is definitely the common point here -
Also with iSCSI storage, right?
-
@olivierlambert not really. This time is just local ext storage, SATA drives.
-
In LVM or thin? It might be 2 different problems, so I'm trying to sort this out.
-
@olivierlambert both! I have both mixed in my servers and I tried in both when I did the tests
-
just remember i have one server with fresh 8.2.1 and nfs backups to TrueNAS. it working.
will do other tests tomorrow. -
@olivierlambert
sr_not_supported
that not a error and not a reason. That because of default multipath Dell config for 3xxx series. Persist at 8.2.0 where CR working, so that just a warning.
As we have no any problems before, we never investigate to this setting. My bad again yay.Replaced it to official for 4xxx and this warning gone. I see at 8.3 it already more universal for any generation.
device { vendor "DellEMC" product "ME4" path_grouping_policy "group_by_prio" path_checker "tur" hardware_handler "1 alua" prio "alua" failback immediate path_selector "service-time 0" }
since it no default config for huawei, so we always used the official one.
device { vendor "HUAWEI" product "XSG1" path_grouping_policy multibus path_checker tur prio const path_selector "round-robin 0" failback immediate fast_io_fail_tmo 5 dev_loss_tmo 30 }
-
8.2.1:
-
CR not working:
both huawei, dell iscsi - multipath enabled
both huawei, dell iscsi - multipath disabled -
working:
nfs vm disk
local thin\ext
local thick\lvm -
8.3
-
working:
both huawei, dell iscsi - multipath enabled
local thick\lvm
and now interesting. After i solved this false warning, detach extra hosts from pool, detach all additional links (trunk, backup) to decrease comunications and log itself - it's no any SMlog generated during backup task.
MP enabled - with 2nd link for backup https://pastebin.com/URcnDckR
MP enabled - only Mng link, no SMlog generated https://pastebin.com/RHw40uzg -