XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    VDI_IO_ERROR(Device I/O errors) when you run scheduled backup

    Scheduled Pinned Locked Moved Xen Orchestra
    66 Posts 11 Posters 17.9k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S Offline
      shorian @shorian
      last edited by shorian

      Forgot to say - the duplicate referred to in the error is the β€œ[Importing...BackupProcessName-timestamp] halted VM that the backup process itself creates - ie the remnants that I spoke of at the outset that needed to be removed that don’t show in the GUI but can be seen via CLI using

      xe vm-list power-state=halted 
      
      S 1 Reply Last reply Reply Quote 0
      • S Offline
        shorian @shorian
        last edited by shorian

        Have just updated the hosts with the swathe of patches that came out this morning, and tested with a fresh build from master as of today.

        Unfortunately still getting these VDI_IO_ERROR errors when running CR, be it from a trial XOA install or from the latest build from master (rebuilt this morning); nothing particularly helpful jumping out of the logs:

        2021-01-23T15_43_13.502Z - backup NG.log.txt

        Question to the community - are other people seeing the same VDI_IO_ERROR error occasionally, or is this a unique case? Even if you can't provide access to your logs for privacy reasons, it would be interesting to know if others are seeing it. Of course, that relies upon others seeing this message.... πŸ™‚

        Thanks!

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates πŸͺ Co-Founder CEO
          last edited by

          Check the usual logs (SMlog and dmesg) to see if we can spot the root cause.

          S 1 Reply Last reply Reply Quote 0
          • S Offline
            shorian @olivierlambert
            last edited by shorian

            @olivierlambert Nothing jumps out from dmesg and nothing useful in SMlog when running the CR, however do get the following exceptions in SMlog upon boot (same error for each disk array):

             SMGC: [24044] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
             SMGC: [24044]          ***********************
             SMGC: [24044]          *  E X C E P T I O N  *
             SMGC: [24044]          ***********************
             SMGC: [24044] gc: EXCEPTION <class 'util.SMException'>, SR 7e772759-f44e-af14-eb45-e839cc67689c not attached on this host
             SMGC: [24044]   File "/opt/xensource/sm/cleanup.py", line 3354, in gc
             SMGC: [24044]     _gc(None, srUuid, dryRun)
             SMGC: [24044]   File "/opt/xensource/sm/cleanup.py", line 3233, in _gc
             SMGC: [24044]     sr = SR.getInstance(srUuid, session)
             SMGC: [24044]   File "/opt/xensource/sm/cleanup.py", line 1552, in getInstance
             SMGC: [24044]     return FileSR(uuid, xapi, createLock, force)
             SMGC: [24044]   File "/opt/xensource/sm/cleanup.py", line 2330, in __init__
             SMGC: [24044]     SR.__init__(self, uuid, xapi, createLock, force)
             SMGC: [24044]   File "/opt/xensource/sm/cleanup.py", line 1582, in __init__
             SMGC: [24044]     raise util.SMException("SR %s not attached on this host" % uuid)
             SMGC: [24044]
             SMGC: [24044] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
             SMGC: [24044] * * * * * SR 7e772759-f44e-af14-eb45-e839cc67689c: ERROR
             SMGC: [24044]
             SM: [24066] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr
             SM: [24066] sr_update {'sr_uuid': '7e772759-f44e-af14-eb45-e839cc67689c', 'subtask_of': 'DummyRef:|ed453e0f-248d-403f-9499-ee7255fdf429|SR.stat', 'args': [], 'host_ref': 'OpaqueRef:c1b3f3e8-579b-4b35-9bb6-dcad830583c3', 'session_ref': 'OpaqueRef:dafc19b5-65af-4501-a151-1999d8e7f550', 'device_config': {'device': '/dev/disk/by-id/scsi-2ad3a730200d00000-part1', 'SRmaster': 'true'}, 'command': 'sr_update', 'sr_ref': 'OpaqueRef:bfd47d28-1be1-4dd6-8a7b-2c0a986b8d47', 'local_cache_sr': '34bd3a97-3562-75af-a24e-a266caf368e3'}
             SM: [24088] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr
             SM: [24088] lock: acquired /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr
             SM: [24088] sr_scan {'sr_uuid': '7e772759-f44e-af14-eb45-e839cc67689c', 'subtask_of': 'DummyRef:|017c9f00-7289-4983-a942-f91c830fed33|SR.scan', 'args': [], 'host_ref': 'OpaqueRef:c1b3f3e8-579b-4b35-9bb6-dcad830583c3', 'session_ref': 'OpaqueRef:a907696c-b7cf-4ac8-b772-b22833073585', 'device_config': {'device': '/dev/disk/by-id/scsi-2ad3a730200d00000-part1', 'SRmaster': 'true'}, 'command': 'sr_scan', 'sr_ref': 'OpaqueRef:bfd47d28-1be1-4dd6-8a7b-2c0a986b8d47', 'local_cache_sr': '34bd3a97-3562-75af-a24e-a266caf368e3'}
             SM: [24088] ['/usr/bin/vhd-util', 'scan', '-f', '-m', '/var/run/sr-mount/7e772759-f44e-af14-eb45-e839cc67689c/*.vhd']
             SM: [24088]   pread SUCCESS
             SM: [24088] ['ls', '/var/run/sr-mount/7e772759-f44e-af14-eb45-e839cc67689c', '-1', '--color=never']
             SM: [24088]   pread SUCCESS
             SM: [24088] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running
             SM: [24088] lock: tried lock /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running, acquired: True (exists: True)
             SM: [24088] lock: released /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running
             SM: [24088] Kicking GC
             SMGC: [24088] === SR 7e772759-f44e-af14-eb45-e839cc67689c: gc ===
             SMGC: [24104] Will finish as PID [24105]
             SM: [24105] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/running
             SMGC: [24088] New PID [24104]
             SM: [24105] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/gc_active
             SM: [24088] lock: released /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr
             SM: [24105] lock: opening lock file /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr
             SMGC: [24105] Found 0 cache files
             SM: [24105] lock: tried lock /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/gc_active, acquired: True (exists: True)
             SM: [24105] lock: tried lock /var/lock/sm/7e772759-f44e-af14-eb45-e839cc67689c/sr, acquired: True (exists: True)
             SM: [24105] ['/usr/bin/vhd-util', 'scan', '-f', '-m', '/var/run/sr-mount/7e772759-f44e-af14-eb45-e839cc67689c/*.vhd']
             SM: [24105]   pread SUCCESS
            

            Subsequent to above, found that get identical error when reattaching SR (after detaching it).

            xapi-explore-sr does not show anything interesting

            Thanks chap; v kind of you

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates πŸͺ Co-Founder CEO
              last edited by

              @shorian said in VDI_IO_ERROR(Device I/O errors) when you run scheduled backup:

              SR 7e772759-f44e-af14-eb45-e839cc67689c not attached on this host

              Weird. Is it a shared SR? If it's the case, the master node need to have access to the SR (because coalesce mechanism works on the master).

              S 1 Reply Last reply Reply Quote 0
              • S Offline
                shorian @olivierlambert
                last edited by

                @olivierlambert Nope, not shared. Standalone box with two local disk arrays, both reporting the same error on boot / reattaching, yet seem performant for normal operations. We only get the exception error in SMlog when booting and frequent (but not always) VDI_IO_ERRORs when taking backups. VMs themselves seem to be fine. Quite strange

                Both SRs are ext4 in RAID1, one array of 2xSSD and one of 2xSATA.

                Should I try reinstalling the host?

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates πŸͺ Co-Founder CEO
                  last edited by

                  That's weird. Can you double check there's no "ghost" host that have been there bore? Eg with xe host-list.

                  Also a xe sr-param-list uuid=7e772759-f44e-af14-eb45-e839cc67689c to see if we can spot anything.

                  S 1 Reply Last reply Reply Quote 0
                  • S Offline
                    shorian @olivierlambert
                    last edited by

                    @olivierlambert Afraid nothing interesting:

                    # xe host-list
                    uuid ( RO)                : 3d60f1d4-595e-4428-9c24-6409db9593bc
                              name-label ( RW): xen11
                        name-description ( RW): ABC
                    
                    # xe sr-param-list uuid=7e772759-f44e-af14-eb45-e839cc67689c
                    
                    uuid ( RO)                    : 7e772759-f44e-af14-eb45-e839cc67689c
                                  name-label ( RW): Xen11 SSD
                            name-description ( RW): Xen11 SSD
                                        host ( RO): xen11
                          allowed-operations (SRO): VDI.enable_cbt; VDI.list_changed_blocks; unplug; plug; PBD.create; VDI.disable_cbt; update; PBD.destroy; VDI.resize; VDI.clone; VDI.data_destroy; scan; VDI.snapshot; VDI.mirror; VDI.create; VDI.destroy; VDI.set_on_boot
                          current-operations (SRO):
                                        VDIs (SRO):
                                        PBDs (SRO): 9e967e89-7263-40b5-5706-cd2d449b7192
                          virtual-allocation ( RO): 0
                        physical-utilisation ( RO): 75665408
                               physical-size ( RO): 491294261248
                                        type ( RO): ext
                                content-type ( RO): user
                                      shared ( RW): false
                               introduced-by ( RO): <not in database>
                                 is-tools-sr ( RO): false
                                other-config (MRW):
                                   sm-config (MRO): devserial: scsi-2ad3a730200d00000
                                       blobs ( RO):
                         local-cache-enabled ( RO): false
                                        tags (SRW):
                                   clustered ( RO): false
                    
                    1 Reply Last reply Reply Quote 0
                    • olivierlambertO Offline
                      olivierlambert Vates πŸͺ Co-Founder CEO
                      last edited by

                      What about xe pbd-param-list uuid=9e967e89-7263-40b5-5706-cd2d449b7192?

                      S 1 Reply Last reply Reply Quote 0
                      • S Offline
                        shorian @olivierlambert
                        last edited by

                        😞

                        # xe pbd-param-list uuid=9e967e89-7263-40b5-5706-cd2d449b7192
                        uuid ( RO)                  : 9e967e89-7263-40b5-5706-cd2d449b7192
                             host ( RO) [DEPRECATED]: 3d60f1d4-595e-4428-9c24-6409db9593bc
                                     host-uuid ( RO): 3d60f1d4-595e-4428-9c24-6409db9593bc
                               host-name-label ( RO): xen11
                                       sr-uuid ( RO): 7e772759-f44e-af14-eb45-e839cc67689c
                                 sr-name-label ( RO): Xen11 SSD
                                 device-config (MRO): device: /dev/disk/by-id/scsi-2ad3a730200d00000-part1
                            currently-attached ( RO): true
                                  other-config (MRW): storage_driver_domain: OpaqueRef:ce759085-6e0d-4484-9f8a-abf75b822f75
                        
                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates πŸͺ Co-Founder CEO
                          last edited by

                          Indeed, nothing special here.

                          S 1 Reply Last reply Reply Quote 0
                          • S Offline
                            shorian @olivierlambert
                            last edited by shorian

                            I'll try a fresh install over the next couple of days and see if it reoccurs. Looking at the other boxes, I have the same error on one of the other hosts, but it's not across all of them despite identical installs and hardware.

                            Thanks for your efforts and help, shame there wasn't an easy answer but let's see if it reoccurs after a completely fresh install.

                            1 Reply Last reply Reply Quote 1
                            • olivierlambertO Offline
                              olivierlambert Vates πŸͺ Co-Founder CEO
                              last edited by

                              Please keep us posted πŸ™‚

                              S 1 Reply Last reply Reply Quote 0
                              • S Offline
                                shorian @olivierlambert
                                last edited by shorian

                                Have just run a clean install - identical error in SMlog 😞

                                Following was the last message in dmesg which seems counter to the SMlog error

                                EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
                                

                                Can send full SMlog and anything else you can think of - but it's either a hardware error (in which case why identical error accross two hosts?) or an incorrect setting in the install process. I'm at a loss

                                S 1 Reply Last reply Reply Quote 0
                                • S Offline
                                  shorian @shorian
                                  last edited by

                                  Have tried reformatting disks after changing the boot record - previously was GPT with ext4, have tried using MBR instead but no difference, even after another fresh install.

                                  I'm going through a comparison with another machine that is identical kit but doesn't have those errors, but can't see any differences whatsoever 😞

                                  1 Reply Last reply Reply Quote 0
                                  • olivierlambertO Offline
                                    olivierlambert Vates πŸͺ Co-Founder CEO
                                    last edited by

                                    Do you have a 4K sector disk?

                                    S 1 Reply Last reply Reply Quote 0
                                    • S Offline
                                      shorian @olivierlambert
                                      last edited by shorian

                                      The (large) SATA drives are, but the SSDs are not; problem is reported for both.

                                      # parted /dev/sdb print
                                      
                                      Model: AVAGO MR9363-4i (scsi)
                                      Disk /dev/sdb: 6001GB
                                      Sector size (logical/physical): 512B/4096B
                                      Partition Table: gpt
                                      Disk Flags:
                                      
                                      Number  Start   End     Size    File system     Name  Flags
                                       5      1049kB  4296MB  4295MB  ext3
                                       2      4296MB  23.6GB  19.3GB
                                       1      23.6GB  43.0GB  19.3GB  ext3
                                       4      43.0GB  43.5GB  537MB   fat16                 boot
                                       6      43.5GB  44.6GB  1074MB  linux-swap(v1)
                                       3      44.6GB  6001GB  5956GB                        lvm
                                      
                                      # parted /dev/sda print
                                      
                                      Model: AVAGO MR9363-4i (scsi)
                                      Disk /dev/sda: 500GB
                                      Sector size (logical/physical): 512B/512B
                                      Partition Table: gpt
                                      Disk Flags:
                                      
                                      Number  Start   End    Size   File system  Name              Flags
                                       1      1049kB  500GB  500GB               Linux filesystem
                                      

                                      Also contrary to my thoughts above - the other machines do have the same errors in SMlog, only they've not been rebooted for many moons so it was in the archived off logs rather than current. So the problem is consistent across the entire estate.

                                      1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates πŸͺ Co-Founder CEO
                                        last edited by

                                        Before refocusing on the original issue, the SMlog error you have on reboot/boot, is it preventing you to use the SR?

                                        S 1 Reply Last reply Reply Quote 0
                                        • S Offline
                                          shorian @olivierlambert
                                          last edited by shorian

                                          No, everything seems to work just fine, except for the failure errors when running CR backups, and the messages in the logs when booting / attaching. The VMs themselves are performant, can be migrated on and off the host, and are handling production loads. If it wasn't for the CR errors, I would never have noticed them.

                                          Meanwhile I've dug through the old Xen bug lists and found some similar SR issues but they seem to result in failure to mount but nothing that matches our specific situation.

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates πŸͺ Co-Founder CEO
                                            last edited by

                                            Hmm that's weird. I don't know what's causing this. Can you try with a local LVM SR and see if you have the exact same issue? (trying to see if it's related to the storage driver)

                                            S 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post