VDI_IO_ERROR(Device I/O errors) when you run scheduled backup
-
Check the usual logs (SMlog and
dmesg
) to see if we can spot the root cause. -
@olivierlambert Nothing jumps out from dmesg and nothing useful in SMlog when running the CR, however do get the following exceptions in SMlog upon boot (same error for each disk array):
SMGC: [24044] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~* SMGC: [24044] *********************** SMGC: [24044] * E X C E P T I O N * SMGC: [24044] *********************** SMGC: [24044] gc: EXCEPTION <class 'util.SMException'>, SR 7e772759-f44e-af14-eb45-e839cc67689c not attached on this host SMGC: [24044] File "/opt/xensource/sm/cleanup.py", line 3354, in gc SMGC: [24044] _gc(None, srUuid, dryRun) SMGC: [24044] File "/opt/xensource/sm/cleanup.py", line 3233, in _gc SMGC: [24044] sr = SR.getInstance(srUuid, session) SMGC: [24044] File "/opt/xensource/sm/cleanup.py", line 1552, in getInstance SMGC: [24044] return FileSR(uuid, xapi, createLock, force) SMGC: [24044] File "/opt/xensource/sm/cleanup.py", line 2330, in __init__ SMGC: [24044] SR.__init__(self, uuid, xapi, createLock, force) SMGC: [24044] File "/opt/xensource/sm/cleanup.py", line 1582, in __init__ SMGC: [24044] raise util.SMException("SR %s not attached on this host" % uuid) SMGC: [24044] SMGC: [24044] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~* SMGC: [24044] * * * * * SR 7e772759-f44e-af14-eb45-e839cc67689c: ERROR SMGC: [24044] SM: [24066] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr SM: [24066] sr_update {'sr_uuid': '7e772759-f44e-af14-eb45-e839cc67689c', 'subtask_of': 'DummyRef:|ed453e0f-248d-403f-9499-ee7255fdf429|SR.stat', 'args': [], 'host_ref': 'OpaqueRef:c1b3f3e8-579b-4b35-9bb6-dcad830583c3', 'session_ref': 'OpaqueRef:dafc19b5-65af-4501-a151-1999d8e7f550', 'device_config': {'device': '/dev/disk/by-id/scsi-2ad3a730200d00000-part1', 'SRmaster': 'true'}, 'command': 'sr_update', 'sr_ref': 'OpaqueRef:bfd47d28-1be1-4dd6-8a7b-2c0a986b8d47', 'local_cache_sr': '34bd3a97-3562-75af-a24e-a266caf368e3'} SM: [24088] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr SM: [24088] lock: acquired /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr SM: [24088] sr_scan {'sr_uuid': '7e772759-f44e-af14-eb45-e839cc67689c', 'subtask_of': 'DummyRef:|017c9f00-7289-4983-a942-f91c830fed33|SR.scan', 'args': [], 'host_ref': 'OpaqueRef:c1b3f3e8-579b-4b35-9bb6-dcad830583c3', 'session_ref': 'OpaqueRef:a907696c-b7cf-4ac8-b772-b22833073585', 'device_config': {'device': '/dev/disk/by-id/scsi-2ad3a730200d00000-part1', 'SRmaster': 'true'}, 'command': 'sr_scan', 'sr_ref': 'OpaqueRef:bfd47d28-1be1-4dd6-8a7b-2c0a986b8d47', 'local_cache_sr': '34bd3a97-3562-75af-a24e-a266caf368e3'} SM: [24088] ['/usr/bin/vhd-util', 'scan', '-f', '-m', '/var/run/sr-mount/7e772759-f44e-af14-eb45-e839cc67689c/*.vhd'] SM: [24088] pread SUCCESS SM: [24088] ['ls', '/var/run/sr-mount/7e772759-f44e-af14-eb45-e839cc67689c', '-1', '--color=never'] SM: [24088] pread SUCCESS SM: [24088] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running SM: [24088] lock: tried lock /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running, acquired: True (exists: True) SM: [24088] lock: released /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running SM: [24088] Kicking GC SMGC: [24088] === SR 7e772759-f44e-af14-eb45-e839cc67689c: gc === SMGC: [24104] Will finish as PID [24105] SM: [24105] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running SMGC: [24088] New PID [24104] SM: [24105] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/gc_active SM: [24088] lock: released /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr SM: [24105] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr SMGC: [24105] Found 0 cache files SM: [24105] lock: tried lock /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/gc_active, acquired: True (exists: True) SM: [24105] lock: tried lock /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr, acquired: True (exists: True) SM: [24105] ['/usr/bin/vhd-util', 'scan', '-f', '-m', '/var/run/sr-mount/7e772759-f44e-af14-eb45-e839cc67689c/*.vhd'] SM: [24105] pread SUCCESS
Subsequent to above, found that get identical error when reattaching SR (after detaching it).
xapi-explore-sr does not show anything interesting
Thanks chap; v kind of you
-
@shorian said in VDI_IO_ERROR(Device I/O errors) when you run scheduled backup:
SR 7e772759-f44e-af14-eb45-e839cc67689c not attached on this host
Weird. Is it a shared SR? If it's the case, the
master
node need to have access to the SR (because coalesce mechanism works on the master). -
@olivierlambert Nope, not shared. Standalone box with two local disk arrays, both reporting the same error on boot / reattaching, yet seem performant for normal operations. We only get the exception error in SMlog when booting and frequent (but not always) VDI_IO_ERRORs when taking backups. VMs themselves seem to be fine. Quite strange
Both SRs are ext4 in RAID1, one array of 2xSSD and one of 2xSATA.
Should I try reinstalling the host?
-
That's weird. Can you double check there's no "ghost" host that have been there bore? Eg with
xe host-list
.Also a
xe sr-param-list uuid=7e772759-f44e-af14-eb45-e839cc67689c
to see if we can spot anything. -
@olivierlambert Afraid nothing interesting:
# xe host-list uuid ( RO) : 3d60f1d4-595e-4428-9c24-6409db9593bc name-label ( RW): xen11 name-description ( RW): ABC
# xe sr-param-list uuid=7e772759-f44e-af14-eb45-e839cc67689c uuid ( RO) : 7e772759-f44e-af14-eb45-e839cc67689c name-label ( RW): Xen11 SSD name-description ( RW): Xen11 SSD host ( RO): xen11 allowed-operations (SRO): VDI.enable_cbt; VDI.list_changed_blocks; unplug; plug; PBD.create; VDI.disable_cbt; update; PBD.destroy; VDI.resize; VDI.clone; VDI.data_destroy; scan; VDI.snapshot; VDI.mirror; VDI.create; VDI.destroy; VDI.set_on_boot current-operations (SRO): VDIs (SRO): PBDs (SRO): 9e967e89-7263-40b5-5706-cd2d449b7192 virtual-allocation ( RO): 0 physical-utilisation ( RO): 75665408 physical-size ( RO): 491294261248 type ( RO): ext content-type ( RO): user shared ( RW): false introduced-by ( RO): <not in database> is-tools-sr ( RO): false other-config (MRW): sm-config (MRO): devserial: scsi-2ad3a730200d00000 blobs ( RO): local-cache-enabled ( RO): false tags (SRW): clustered ( RO): false
-
What about
xe pbd-param-list uuid=9e967e89-7263-40b5-5706-cd2d449b7192
? -
# xe pbd-param-list uuid=9e967e89-7263-40b5-5706-cd2d449b7192 uuid ( RO) : 9e967e89-7263-40b5-5706-cd2d449b7192 host ( RO) [DEPRECATED]: 3d60f1d4-595e-4428-9c24-6409db9593bc host-uuid ( RO): 3d60f1d4-595e-4428-9c24-6409db9593bc host-name-label ( RO): xen11 sr-uuid ( RO): 7e772759-f44e-af14-eb45-e839cc67689c sr-name-label ( RO): Xen11 SSD device-config (MRO): device: /dev/disk/by-id/scsi-2ad3a730200d00000-part1 currently-attached ( RO): true other-config (MRW): storage_driver_domain: OpaqueRef:ce759085-6e0d-4484-9f8a-abf75b822f75
-
Indeed, nothing special here.
-
I'll try a fresh install over the next couple of days and see if it reoccurs. Looking at the other boxes, I have the same error on one of the other hosts, but it's not across all of them despite identical installs and hardware.
Thanks for your efforts and help, shame there wasn't an easy answer but let's see if it reoccurs after a completely fresh install.
-
Please keep us posted
-
Have just run a clean install - identical error in SMlog
Following was the last message in dmesg which seems counter to the SMlog error
EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
Can send full SMlog and anything else you can think of - but it's either a hardware error (in which case why identical error accross two hosts?) or an incorrect setting in the install process. I'm at a loss
-
Have tried reformatting disks after changing the boot record - previously was GPT with ext4, have tried using MBR instead but no difference, even after another fresh install.
I'm going through a comparison with another machine that is identical kit but doesn't have those errors, but can't see any differences whatsoever
-
Do you have a 4K sector disk?
-
The (large) SATA drives are, but the SSDs are not; problem is reported for both.
# parted /dev/sdb print Model: AVAGO MR9363-4i (scsi) Disk /dev/sdb: 6001GB Sector size (logical/physical): 512B/4096B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 5 1049kB 4296MB 4295MB ext3 2 4296MB 23.6GB 19.3GB 1 23.6GB 43.0GB 19.3GB ext3 4 43.0GB 43.5GB 537MB fat16 boot 6 43.5GB 44.6GB 1074MB linux-swap(v1) 3 44.6GB 6001GB 5956GB lvm # parted /dev/sda print Model: AVAGO MR9363-4i (scsi) Disk /dev/sda: 500GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 500GB 500GB Linux filesystem
Also contrary to my thoughts above - the other machines do have the same errors in SMlog, only they've not been rebooted for many moons so it was in the archived off logs rather than current. So the problem is consistent across the entire estate.
-
Before refocusing on the original issue, the SMlog error you have on reboot/boot, is it preventing you to use the SR?
-
No, everything seems to work just fine, except for the failure errors when running CR backups, and the messages in the logs when booting / attaching. The VMs themselves are performant, can be migrated on and off the host, and are handling production loads. If it wasn't for the CR errors, I would never have noticed them.
Meanwhile I've dug through the old Xen bug lists and found some similar SR issues but they seem to result in failure to mount but nothing that matches our specific situation.
-
Hmm that's weird. I don't know what's causing this. Can you try with a local LVM SR and see if you have the exact same issue? (trying to see if it's related to the storage driver)
-
Fresh install, LVM on boot device, same error, created second disk as LVM through XO, same error, rebooted, same pair of errors.
Error appears to be agnostic of storage type.
Found a host with same drives, same CPU, same config, different RAID card (Adaptec instead of LSI), same error.
Error appears to be agnostic of storage manufacturer.
It seems to be the gc that causes it - elsewhere in the logs gc takes a lock, checks, decides there's nothing to cleanup, releases lock. Just when it boots, or when it is attached, its as though there is a slight delay where the storage reports as not being attached, fires the exception into SMlog, then gc tries again and connects just fine and gets the lock. We all continue as before, no other issues, no other errors, until either a reboot or a storage re-attach via CLI or XO. The EXCEPTION error does not occur during normal usage, nor during a backup.
Looking into xensource.log, we do get some errors as per below:
xapi: [ warn||1568 INET :::80|event.from D:87552104bbaf|xapi_message] get_since_for_events: no in_memory_cache! xapi: [debug||1569 INET :::80|host.call_plugin R:91601df4a1d6|audit] Host.call_plugin host = 'c7634b1b-79b1-4eb2-840b-ecc401b7743d (xcp-ng-xen11)'; plugin = 'updater.py'; fn = 'check_update' args = [ 'hidden' ] xapi: [ warn||1568 INET :::80|event.from D:87552104bbaf|xapi_message] get_since_for_events: no in_memory_cache! xapi: [ warn||1570 INET :::80|event.from D:b4ea13ab245c|xapi_message] get_since_for_events: no in_memory_cache! xapi: [ warn||1570 INET :::80|event.from D:b4ea13ab245c|xapi_message] get_since_for_events: no in_memory_cache! xapi: [ warn||1571 INET :::80|event.from D:8b9b61c7b0b2|xapi_message] get_since_for_events: no in_memory_cache! xapi: [error||1572 INET :::80||backtrace] host.get_by_uuid D:e4eb5b632522 failed with exception Db_exn.Read_missing_uuid("host", "", "df8fe71d-f748-4df4-bd53-97d8232249e4") xapi: [error||1572 INET :::80||backtrace] Raised Db_exn.Read_missing_uuid("host", "", "df8fe71d-f748-4df4-bd53-97d8232249e4") xapi: [error||1572 INET :::80||backtrace] 1/9 xapi Raised at file ocaml/database/db_cache_impl.ml, line 232 xapi: [error||1572 INET :::80||backtrace] 2/9 xapi Called from file ocaml/xapi/db_actions.ml, line 8179 xapi: [error||1572 INET :::80||backtrace] 3/9 xapi Called from file ocaml/xapi/rbac.ml, line 223 xapi: [error||1572 INET :::80||backtrace] 4/9 xapi Called from file ocaml/xapi/rbac.ml, line 231 xapi: [error||1572 INET :::80||backtrace] 5/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 100 xapi: [error||1572 INET :::80||backtrace] 6/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 121 xapi: [error||1572 INET :::80||backtrace] 7/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||1572 INET :::80||backtrace] 8/9 xapi Called from file map.ml, line 135 xapi: [error||1572 INET :::80||backtrace] 9/9 xapi Called from file sexp_conv.ml, line 147 xapi: [error||1572 INET :::80||backtrace] xapi: [ warn||1571 INET :::80|event.from D:8b9b61c7b0b2|xapi_message] get_since_for_events: no in_memory_cache!
However, given this only seems to occur at the point of boot or attaching the SR, I'm not convinced that this error is related to the VDI_IO_ERROR that we're seeing when running CR backups, whereupon xensource has the following:
xapi: [debug||257548 INET :::80|[XO] VDI Export R:2a08e12cc000|taskhelper] the status of R:2a08e12cc000 is failure; cannot set it to failure xapi: [error||257548 INET :::80||backtrace] [XO] VDI Export R:2a08e12cc000 failed with exception Server_error(VDI_IO_ERROR, [ Device I/O errors ]) xapi: [error||257548 INET :::80||backtrace] Raised Server_error(VDI_IO_ERROR, [ Device I/O errors ]) xapi: [error||257548 INET :::80||backtrace] 1/13 xapi Raised at file ocaml/xapi/vhd_tool_wrapper.ml, line 75 xapi: [error||257548 INET :::80||backtrace] 2/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||257548 INET :::80||backtrace] 3/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 xapi: [error||257548 INET :::80||backtrace] 4/13 xapi Called from file ocaml/xapi/export_raw_vdi.ml, line 49 xapi: [error||257548 INET :::80||backtrace] 5/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||257548 INET :::80||backtrace] 6/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 xapi: [error||257548 INET :::80||backtrace] 7/13 xapi Called from file ocaml/xapi/export_raw_vdi.ml, line 86 xapi: [error||257548 INET :::80||backtrace] 8/13 xapi Called from file ocaml/xapi/export_raw_vdi.ml, line 110 xapi: [error||257548 INET :::80||backtrace] 9/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 100 xapi: [error||257548 INET :::80||backtrace] 10/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 121 xapi: [error||257548 INET :::80||backtrace] 11/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||257548 INET :::80||backtrace] 12/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 xapi: [error||257548 INET :::80||backtrace] 13/13 xapi Called from file lib/backtrace.ml, line 177 xapi: [error||257548 INET :::80||backtrace] xapi: [debug||258603 UNIX /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.logout D:f14c1d127f16 created by task D:4493412bc527 xapi: [ info||258603 UNIX /var/lib/xcp/xapi|session.logout D:4a1ab726bfb5|xapi_session] Session.destroy trackid=fc26bb0306ef13c29fa128ddb1338ca7 xapi: [error||257548 INET :::80||backtrace] VDI.export_raw_vdi D:4493412bc527 failed with exception Server_error(VDI_IO_ERROR, [ Device I/O errors ]) xapi: [error||257548 INET :::80||backtrace] Raised Server_error(VDI_IO_ERROR, [ Device I/O errors ]) xapi: [error||257548 INET :::80||backtrace] 1/13 xapi Raised at file lib/debug.ml, line 218 xapi: [error||257548 INET :::80||backtrace] 2/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||257548 INET :::80||backtrace] 3/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 xapi: [error||257548 INET :::80||backtrace] 4/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||257548 INET :::80||backtrace] 5/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 xapi: [error||257548 INET :::80||backtrace] 6/13 xapi Called from file ocaml/xapi/xapi_http.ml, line 193 xapi: [error||257548 INET :::80||backtrace] 7/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||257548 INET :::80||backtrace] 8/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 xapi: [error||257548 INET :::80||backtrace] 9/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 100 xapi: [error||257548 INET :::80||backtrace] 10/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 121 xapi: [error||257548 INET :::80||backtrace] 11/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 xapi: [error||257548 INET :::80||backtrace] 12/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 xapi: [error||257548 INET :::80||backtrace] 13/13 xapi Called from file lib/backtrace.ml, line 177 xapi: [error||257548 INET :::80||backtrace] xapi: [error||257548 INET :::80||backtrace] VDI.export_raw_vdi D:2f552cd49c3a failed with exception Server_error(VDI_IO_ERROR, [ Device I/O errors ]) xapi: [error||257548 INET :::80||backtrace] Raised Server_error(VDI_IO_ERROR, [ Device I/O errors ]) xapi: [error||257548 INET :::80||backtrace] 1/1 xapi Raised at file (Thread 257548 has no backtrace table. Was with_backtraces called?, line 0 xapi: [error||257548 INET :::80||backtrace] xapi: [error||257548 INET :::80||xapi_http] Unhandled Api_errors.Server_error(VDI_IO_ERROR, [ Device I/O errors ])
Two separate issues? The VDI_IO_ERROR is one that people would see through the GUI, so likely to be spotted, whereas the EXCEPTION in the SMlog is not going to be noticed unless you go digging - it seems to have no impact.
-
I suppose there's no timing correlation between both? Indeed, the error in xensource seems more meaningful, because it happens during the backup process.