XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    VDI_IO_ERROR(Device I/O errors) when you run scheduled backup

    Scheduled Pinned Locked Moved Xen Orchestra
    66 Posts 11 Posters 14.5k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S Offline
      shorian @olivierlambert
      last edited by shorian

      The (large) SATA drives are, but the SSDs are not; problem is reported for both.

      # parted /dev/sdb print
      
      Model: AVAGO MR9363-4i (scsi)
      Disk /dev/sdb: 6001GB
      Sector size (logical/physical): 512B/4096B
      Partition Table: gpt
      Disk Flags:
      
      Number  Start   End     Size    File system     Name  Flags
       5      1049kB  4296MB  4295MB  ext3
       2      4296MB  23.6GB  19.3GB
       1      23.6GB  43.0GB  19.3GB  ext3
       4      43.0GB  43.5GB  537MB   fat16                 boot
       6      43.5GB  44.6GB  1074MB  linux-swap(v1)
       3      44.6GB  6001GB  5956GB                        lvm
      
      # parted /dev/sda print
      
      Model: AVAGO MR9363-4i (scsi)
      Disk /dev/sda: 500GB
      Sector size (logical/physical): 512B/512B
      Partition Table: gpt
      Disk Flags:
      
      Number  Start   End    Size   File system  Name              Flags
       1      1049kB  500GB  500GB               Linux filesystem
      

      Also contrary to my thoughts above - the other machines do have the same errors in SMlog, only they've not been rebooted for many moons so it was in the archived off logs rather than current. So the problem is consistent across the entire estate.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Before refocusing on the original issue, the SMlog error you have on reboot/boot, is it preventing you to use the SR?

        S 1 Reply Last reply Reply Quote 0
        • S Offline
          shorian @olivierlambert
          last edited by shorian

          No, everything seems to work just fine, except for the failure errors when running CR backups, and the messages in the logs when booting / attaching. The VMs themselves are performant, can be migrated on and off the host, and are handling production loads. If it wasn't for the CR errors, I would never have noticed them.

          Meanwhile I've dug through the old Xen bug lists and found some similar SR issues but they seem to result in failure to mount but nothing that matches our specific situation.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Hmm that's weird. I don't know what's causing this. Can you try with a local LVM SR and see if you have the exact same issue? (trying to see if it's related to the storage driver)

            S 1 Reply Last reply Reply Quote 0
            • S Offline
              shorian @olivierlambert
              last edited by shorian

              Fresh install, LVM on boot device, same error, created second disk as LVM through XO, same error, rebooted, same pair of errors.

              Error appears to be agnostic of storage type.

              Found a host with same drives, same CPU, same config, different RAID card (Adaptec instead of LSI), same error.

              Error appears to be agnostic of storage manufacturer.

              It seems to be the gc that causes it - elsewhere in the logs gc takes a lock, checks, decides there's nothing to cleanup, releases lock. Just when it boots, or when it is attached, its as though there is a slight delay where the storage reports as not being attached, fires the exception into SMlog, then gc tries again and connects just fine and gets the lock. We all continue as before, no other issues, no other errors, until either a reboot or a storage re-attach via CLI or XO. The EXCEPTION error does not occur during normal usage, nor during a backup.

              Looking into xensource.log, we do get some errors as per below:

              xapi: [ warn||1568 INET :::80|event.from D:87552104bbaf|xapi_message] get_since_for_events: no in_memory_cache!
              xapi: [debug||1569 INET :::80|host.call_plugin R:91601df4a1d6|audit] Host.call_plugin host = 'c7634b1b-79b1-4eb2-840b-ecc401b7743d (xcp-ng-xen11)'; plugin = 'updater.py'; fn = 'check_update' args = [ 'hidden' ]
              xapi: [ warn||1568 INET :::80|event.from D:87552104bbaf|xapi_message] get_since_for_events: no in_memory_cache!
              xapi: [ warn||1570 INET :::80|event.from D:b4ea13ab245c|xapi_message] get_since_for_events: no in_memory_cache!
              xapi: [ warn||1570 INET :::80|event.from D:b4ea13ab245c|xapi_message] get_since_for_events: no in_memory_cache!
              xapi: [ warn||1571 INET :::80|event.from D:8b9b61c7b0b2|xapi_message] get_since_for_events: no in_memory_cache!
              xapi: [error||1572 INET :::80||backtrace] host.get_by_uuid D:e4eb5b632522 failed with exception Db_exn.Read_missing_uuid("host", "", "df8fe71d-f748-4df4-bd53-97d8232249e4")
              xapi: [error||1572 INET :::80||backtrace] Raised Db_exn.Read_missing_uuid("host", "", "df8fe71d-f748-4df4-bd53-97d8232249e4")
              xapi: [error||1572 INET :::80||backtrace] 1/9 xapi Raised at file ocaml/database/db_cache_impl.ml, line 232
              xapi: [error||1572 INET :::80||backtrace] 2/9 xapi Called from file ocaml/xapi/db_actions.ml, line 8179
              xapi: [error||1572 INET :::80||backtrace] 3/9 xapi Called from file ocaml/xapi/rbac.ml, line 223
              xapi: [error||1572 INET :::80||backtrace] 4/9 xapi Called from file ocaml/xapi/rbac.ml, line 231
              xapi: [error||1572 INET :::80||backtrace] 5/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 100
              xapi: [error||1572 INET :::80||backtrace] 6/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 121
              xapi: [error||1572 INET :::80||backtrace] 7/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||1572 INET :::80||backtrace] 8/9 xapi Called from file map.ml, line 135
              xapi: [error||1572 INET :::80||backtrace] 9/9 xapi Called from file sexp_conv.ml, line 147
              xapi: [error||1572 INET :::80||backtrace]
              xapi: [ warn||1571 INET :::80|event.from D:8b9b61c7b0b2|xapi_message] get_since_for_events: no in_memory_cache!
              

              However, given this only seems to occur at the point of boot or attaching the SR, I'm not convinced that this error is related to the VDI_IO_ERROR that we're seeing when running CR backups, whereupon xensource has the following:

              xapi: [debug||257548 INET :::80|[XO] VDI Export R:2a08e12cc000|taskhelper] the status of R:2a08e12cc000 is failure; cannot set it to failure
              xapi: [error||257548 INET :::80||backtrace] [XO] VDI Export R:2a08e12cc000 failed with exception Server_error(VDI_IO_ERROR, [ Device I/O errors ])
              xapi: [error||257548 INET :::80||backtrace] Raised Server_error(VDI_IO_ERROR, [ Device I/O errors ])
              xapi: [error||257548 INET :::80||backtrace] 1/13 xapi Raised at file ocaml/xapi/vhd_tool_wrapper.ml, line 75
              xapi: [error||257548 INET :::80||backtrace] 2/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||257548 INET :::80||backtrace] 3/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
              xapi: [error||257548 INET :::80||backtrace] 4/13 xapi Called from file ocaml/xapi/export_raw_vdi.ml, line 49
              xapi: [error||257548 INET :::80||backtrace] 5/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||257548 INET :::80||backtrace] 6/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
              xapi: [error||257548 INET :::80||backtrace] 7/13 xapi Called from file ocaml/xapi/export_raw_vdi.ml, line 86
              xapi: [error||257548 INET :::80||backtrace] 8/13 xapi Called from file ocaml/xapi/export_raw_vdi.ml, line 110
              xapi: [error||257548 INET :::80||backtrace] 9/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 100
              xapi: [error||257548 INET :::80||backtrace] 10/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 121
              xapi: [error||257548 INET :::80||backtrace] 11/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||257548 INET :::80||backtrace] 12/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
              xapi: [error||257548 INET :::80||backtrace] 13/13 xapi Called from file lib/backtrace.ml, line 177
              xapi: [error||257548 INET :::80||backtrace]
              xapi: [debug||258603 UNIX /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.logout D:f14c1d127f16 created by task D:4493412bc527
              xapi: [ info||258603 UNIX /var/lib/xcp/xapi|session.logout D:4a1ab726bfb5|xapi_session] Session.destroy trackid=fc26bb0306ef13c29fa128ddb1338ca7
              xapi: [error||257548 INET :::80||backtrace] VDI.export_raw_vdi D:4493412bc527 failed with exception Server_error(VDI_IO_ERROR, [ Device I/O errors ])
              xapi: [error||257548 INET :::80||backtrace] Raised Server_error(VDI_IO_ERROR, [ Device I/O errors ])
              xapi: [error||257548 INET :::80||backtrace] 1/13 xapi Raised at file lib/debug.ml, line 218
              xapi: [error||257548 INET :::80||backtrace] 2/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||257548 INET :::80||backtrace] 3/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
              xapi: [error||257548 INET :::80||backtrace] 4/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||257548 INET :::80||backtrace] 5/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
              xapi: [error||257548 INET :::80||backtrace] 6/13 xapi Called from file ocaml/xapi/xapi_http.ml, line 193
              xapi: [error||257548 INET :::80||backtrace] 7/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||257548 INET :::80||backtrace] 8/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
              xapi: [error||257548 INET :::80||backtrace] 9/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 100
              xapi: [error||257548 INET :::80||backtrace] 10/13 xapi Called from file ocaml/xapi/server_helpers.ml, line 121
              xapi: [error||257548 INET :::80||backtrace] 11/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
              xapi: [error||257548 INET :::80||backtrace] 12/13 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
              xapi: [error||257548 INET :::80||backtrace] 13/13 xapi Called from file lib/backtrace.ml, line 177
              xapi: [error||257548 INET :::80||backtrace]
              xapi: [error||257548 INET :::80||backtrace] VDI.export_raw_vdi D:2f552cd49c3a failed with exception Server_error(VDI_IO_ERROR, [ Device I/O errors ])
              xapi: [error||257548 INET :::80||backtrace] Raised Server_error(VDI_IO_ERROR, [ Device I/O errors ])
              xapi: [error||257548 INET :::80||backtrace] 1/1 xapi Raised at file (Thread 257548 has no backtrace table. Was with_backtraces called?, line 0
              xapi: [error||257548 INET :::80||backtrace]
              xapi: [error||257548 INET :::80||xapi_http] Unhandled Api_errors.Server_error(VDI_IO_ERROR, [ Device I/O errors ])
              

              Two separate issues? The VDI_IO_ERROR is one that people would see through the GUI, so likely to be spotted, whereas the EXCEPTION in the SMlog is not going to be noticed unless you go digging - it seems to have no impact.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                I suppose there's no timing correlation between both? Indeed, the error in xensource seems more meaningful, because it happens during the backup process.

                S 1 Reply Last reply Reply Quote 0
                • S Offline
                  shorian @olivierlambert
                  last edited by

                  Nothing in SMlog of interest at the point of backup. We only started looking there when hunting for any SR related issues, but I’m inclined to view it as ancillary. The SR exception doesn’t seem to be impacting day to day operations, whereas the CR item is a big concern for us, and now we’ve dug into it further there doesn’t seem to be a correlation between the two.

                  So...I agree with your synopsis and propose putting the SMlog’s exception to one side and we concentrate on the original issue - with the xensource.log error from my previous post.

                  Tomorrow I’m going to see if I can find a way of simulating a brief access delay it and see if that recreates the problem - I’ve seen something vaguely similar in some old Xen 6 bug reports and I’ve run out of anything sensible to try. Strangest thing is that it’s consistent across all of the boxes, and survives a fresh installation - which reminds me, I’ll also try it from and to an lvm target.

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    Thanks. That's a weird issue anyway, eager to know if you have the same behavior with local LVM.

                    S 1 Reply Last reply Reply Quote 0
                    • S Offline
                      shorian @olivierlambert
                      last edited by

                      Initial test - looks like it's working fine running CR from EXT4 to LVM. I'm now doing a deep soak to increase confidence in that statement, but so far so good.

                      Good - in that it narrows down where we look. Bad - in that I need the thin provisioning, but pleased we've got some progress. Good shout @olivierlambert 🙂

                      S 1 Reply Last reply Reply Quote 0
                      • S Offline
                        shorian @shorian
                        last edited by shorian

                        Soak test running between two local hosts, ext4 -> lvm has worked faultlessly. I've now run 10 complete backups, without a single issue. Previous doing this ext4 > ext4 would have thrown errors maybe 25-50% of the time.

                        However running between two remote hosts. ext4 -> lvm over a 1gbps connection has hit the VDI_IO_ERROR issue 😞

                        I'll try the reverse, lvm -> ext4. lvm -> lvm might be a challenge but I'll try and get something organised; to rearrange the boxes a little.

                        S 1 Reply Last reply Reply Quote 0
                        • S Offline
                          shorian @shorian
                          last edited by

                          @olivierlambert Update after 20+ iterations - writing to lvm (ext4 -> lvm) between remote hosts has only failed 1/20 times which for me is more than acceptable, compared with writing ext4 -> ext4 which fails 80%.

                          Given when it did fail on lvm it was the same error, I think it's less likely to be the driver as the root cause, but instead a timing or back pressure issue? All the more so given that the issue only arose when writing to a remote (outside the firewall) host, but writing to a host inside the firewall has not had a single issue.

                          Also I think it's interesting that the issue was (largely) resolved by making the change at the target end, not the source, despite there not being much in the way of error logs at the point of failure on the target.

                          So....where to look next? Do you any tools that can simulate the above so we can enable / disable and therefore isolate the problem?

                          (Happy to allow you in via XOA if it makes life easier)

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            It's really hard to track those random issues.

                            Sounds like a kind of race condition happening on your hardware/setup.

                            S 1 Reply Last reply Reply Quote 0
                            • S Offline
                              shorian @olivierlambert
                              last edited by

                              @olivierlambert So chances of resolving are minimal to none. Ugh. I'll keep digging around, it's proving a right pain in the @£$%

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Well, if you spot something or able to reproduce it easily with some protocol other people could try and also reproduce, that would increase the chances a lot 🙂

                                1 Reply Last reply Reply Quote 0
                                • S Offline
                                  shorian
                                  last edited by shorian

                                  Some more insight - for past couple of weeks I've been running a CR backup simultaneously to two SRs on different boxes. One consistently fails (remote to an identical XCP-ng box, but over the web on a 1gbps link) with the usual error, but the other succeeds (inside the firewall, 1gbps link, identical XCP-ng box), which would point to the problem not being at the source but the use of a wide area transit.

                                  So...here is SMlog at the time of the failure on the target:

                                  SM: [24325] vdi_detach {'sr_uuid': 'cf2dbaa3-21f3-903b-0fd1-fbe68539f897', 'subtask_of': 'DummyRef:|b73e41cc-a667-4ac0-b0f9-545dcd88b08a|VDI.detach', 'vdi_ref': 'OpaqueRef:3c20a324-2c22-4977-a63e-5baee37f9372', 'vdi_on_boot': 'persist', 'args': [], 'o_direct': False, 'vdi_location': '923acf0b-03bf-44ba-a846-f468141910f3', 'host_ref': 'OpaqueRef:d1eeb523-7938-4d80-a465-68ba2f149846', 'session_ref': 'OpaqueRef:a73dd5d5-2115-4bff-9d56-5cec26e267d8', 'device_config': {'device': '/dev/disk/by-id/scsi-3600605b010276170276bb1850660dd94-part3', 'SRmaster': 'true'}, 'command': 'vdi_detach', 'vdi_allow_caching': 'false', 'sr_ref': 'OpaqueRef:0cb18a22-646b-4dab-b08e-21ca36628f62', 'local_cache_sr': 'cf2dbaa3-21f3-903b-0fd1-fbe68539f897', 'vdi_uuid': '923acf0b-03bf-44ba-a846-f468141910f3'}
                                  SM: [24325] lock: opening lock file /var/lock/sm/923acf0b-03bf-44ba-a846-f468141910f3/vdi
                                  SM: [24325] lock: released /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/sr
                                  SM: [24380] lock: opening lock file /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/sr
                                  SM: [24380] lock: acquired /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/sr
                                  SM: [24380] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/923acf0b-03bf-44ba-a846-f468141910f3.vhd']
                                  SM: [24380]   pread SUCCESS
                                  SM: [24380] vdi_delete {'sr_uuid': 'cf2dbaa3-21f3-903b-0fd1-fbe68539f897', 'subtask_of': 'DummyRef:|6236329b-d253-4da6-bc9b-d5498fd02069|VDI.destroy', 'vdi_ref': 'OpaqueRef:3c20a324-2c22-4977-a63e-5baee37f9372', 'vdi_on_boot': 'persist', 'args': [], 'o_direct': False, 'vdi_location': '923acf0b-03bf-44ba-a846-f468141910f3', 'host_ref': 'OpaqueRef:d1eeb523-7938-4d80-a465-68ba2f149846', 'session_ref': 'OpaqueRef:b8fbaaeb-221f-4956-9088-f8565ba34d0a', 'device_config': {'device': '/dev/disk/by-id/scsi-3600605b010276170276bb1850660dd94-part3', 'SRmaster': 'true'}, 'command': 'vdi_delete', 'vdi_allow_caching': 'false', 'sr_ref': 'OpaqueRef:0cb18a22-646b-4dab-b08e-21ca36628f62', 'local_cache_sr': 'cf2dbaa3-21f3-903b-0fd1-fbe68539f897', 'vdi_uuid': '923acf0b-03bf-44ba-a846-f468141910f3'}
                                  SM: [24380] lock: unlinking lock file /var/lock/sm/923acf0b-03bf-44ba-a846-f468141910f3/vdi
                                  SM: [24380] lock: removing lock dir /var/lock/sm/923acf0b-03bf-44ba-a846-f468141910f3
                                  SM: [24380] lock: opening lock file /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/running
                                  SM: [24380] lock: tried lock /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/running, acquired: True (exists: True)
                                  SM: [24380] lock: released /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/running
                                  SM: [24380] Kicking GC
                                  SMGC: [24380] === SR cf2dbaa3-21f3-903b-0fd1-fbe68539f897: gc ===
                                  SMGC: [24405] Will finish as PID [24406]
                                  SMGC: [24380] New PID [24405]
                                  SM: [24406] lock: opening lock file /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/running
                                  SM: [24406] lock: opening lock file /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/gc_active
                                  SM: [24406] lock: opening lock file /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/sr
                                  SM: [24380] lock: released /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/sr
                                  SMGC: [24406] Found 0 cache files
                                  SM: [24406] lock: tried lock /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/gc_active, acquired: True (exists: True)
                                  SM: [24406] lock: tried lock /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/sr, acquired: True (exists: True)
                                  SM: [24406] ['/usr/bin/vhd-util', 'scan', '-f', '-m', '/var/run/sr-mount/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/*.vhd']
                                  SM: [24406]   pread SUCCESS
                                  SMGC: [24406] SR cf2d ('Xen12 SATA') (18 VDIs in 16 VHD trees):
                                  SMGC: [24406]         b1392757(5.000G/1.589G)
                                  SMGC: [24406]         *d1789c22(150.000G/87.188G)
                                  SMGC: [24406]             71e7c935(150.000G/313.500K)
                                  SMGC: [24406]             aae21e12(150.000G/1.218G)
                                  SMGC: [24406]         7b458e2e(10.000G/3.106G)
                                  SMGC: [24406]         0fd9483c(25.000G/15.661G)
                                  SMGC: [24406]         ae6739bf(10.000G/4.730G)
                                  SMGC: [24406]         d25b8134(20.000G/3.759G)
                                  SMGC: [24406]         265ed8d8(150.000G/16.609G)
                                  SMGC: [24406]         32474df8(100.000G/98.434G)
                                  SMGC: [24406]         5ae3d2e6(50.000G/46.990G)
                                  SMGC: [24406]         7fc5f335(250.000G/16.885G)
                                  SMGC: [24406]         ba7c92a3(25.000G/15.661G)
                                  SMGC: [24406]         d6de6348(20.000G/9.978G)
                                  SMGC: [24406]         24c6b7e8(20.000G/9.978G)
                                  SMGC: [24406]         ab433fff(50.000G/49.595G)
                                  SMGC: [24406]         6750bf45(10.000G/3.704G)
                                  SMGC: [24406]         f9e3cd7e(10.000G/5.949G)
                                  SMGC: [24406]
                                  SM: [24406] lock: released /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/sr
                                  SMGC: [24406] Got sm-config for *d1789c22(150.000G/87.188G): {'vhd-blocks': 'eJzt1kESgiAUBmC5QTfgqB2Fq9QJOsKzCzTjqlooFakjkwjiQ5jm/xculHkf0MPUesil0zeNIAiCzOaa0T7rrhKBY4ld76pP3H7LLjr8Z2LI6Yv67cvEDLl9s/7UvjP9+nPx3v4L7PnoNk2x/7TKF13Z++8+mhzvhr7/FEOpSN+//rkQmy+ifK4M65fZfPEoYf35fHEqYv1NZj8wxOQOdYrZ/3/xxz5SrH7s5+Hd65vfX64qqszVPaVvNQopFbv/x9m7/ZSUDp6A5TcU+xZQv7fkdEZh/ups/wLZ//zZW5z//z+lT/4hpbz/1HCHFka7joc0V7sbl+qY58ryJ1FzVmCry7BhY3z9b8+iPVSsEb2/uU5mPzbw4cOHDx8+fPjw4cOHD38vv05U+QVg667e'}
                                  SMGC: [24406] No work, exiting
                                  SMGC: [24406] GC process exiting, no work left
                                  SM: [24406] lock: released /var/lock/sm/cf2dbaa3-21f3-903b-0fd1-fbe68539f897/gc_active
                                  SMGC: [24406] In cleanup
                                  SMGC: [24406] SR cf2d ('Xen12 SATA') (18 VDIs in 16 VHD trees): no changes
                                  

                                  The only thing I can see in there is the garbage collection.

                                  Given that the issue seems to be something to do with the wide area network (same speed as inside the firewall, but obviously more hops), I wonder if this may be a similar problem as we're seeing in the S3 backups over the WAN? Could it be that some packets are being delayed in transit, or that there is an internal time limit on a backup chunk completing? Would also explain why backing up smaller VMs has a higher chance of success than the larger ones, since the % chance of a temporary timeout increases the longer a backup runs?

                                  I've kept the set up as (finally) I now have 100% repeatability on the CR backup to the remote SR failure, so if someone wants to come in and have a look around it can be arranged. Others seem to have the same problem so hopefully this can help get to the root cause, as this the first time I've managed to get it to consistently fail every time. Appreciate that this doesn't sound like a success, but it beats an intermittent failure!

                                  1 Reply Last reply Reply Quote 0
                                  • olivierlambertO Offline
                                    olivierlambert Vates 🪐 Co-Founder CEO
                                    last edited by

                                    Out of curiosity, can you do a: dmesg | grep -i cut -C10 (on your master host)

                                    S 1 Reply Last reply Reply Quote 0
                                    • S Offline
                                      shorian @olivierlambert
                                      last edited by

                                      @olivierlambert Target or source?

                                      1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates 🪐 Co-Founder CEO
                                        last edited by

                                        Try both.

                                        1 Reply Last reply Reply Quote 0
                                        • S Offline
                                          shorian
                                          last edited by

                                          Source:

                                          # dmesg | grep -i cut -C10
                                          [    0.000000] Xen: [mem 0x0000000079a66000-0x0000000079f6efff] ACPI NVS
                                          [    0.000000] Xen: [mem 0x0000000079f6f000-0x000000008fffffff] reserved
                                          [    0.000000] Xen: [mem 0x00000000c7ffc000-0x00000000c7ffcfff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fbffc000-0x00000000fbffcfff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec01fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fec40000-0x00000000fec40fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
                                          [    0.000000] Xen: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
                                          [    0.000000] Xen: [mem 0x0000000100000000-0x0000000270f0efff] usable
                                          [    0.000000] NX (Execute Disable) protection: active
                                          [    0.000000] SMBIOS 3.0 present.
                                          [    0.000000] DMI: Supermicro X10DRi/X10DRi, BIOS 2.1 09/13/2016
                                          [    0.000000] Hypervisor detected: Xen PV
                                          [    0.047663] tsc: Fast TSC calibration using PIT
                                          [    0.047664] tsc: Detected 2399.968 MHz processor
                                          [    0.047665] tsc: Detected 2400.010 MHz TSC
                                          [    0.049570] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
                                          [    0.049572] e820: remove [mem 0x000a0000-0x000fffff] usable
                                          [    0.049578] last_pfn = 0x270f0f max_arch_pfn = 0x400000000
                                          [    0.049579] Disabled
                                          

                                          Target:

                                          dmesg | grep -i cut -C10
                                          [    0.000000] Xen: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fec18000-0x00000000fec18fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fec20000-0x00000000fec20fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fec28000-0x00000000fec28fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fec30000-0x00000000fec30fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fec38000-0x00000000fec38fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed44fff] reserved
                                          [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
                                          [    0.000000] Xen: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
                                          [    0.000000] Xen: [mem 0x0000000100000000-0x00000004a52a5fff] usable
                                          [    0.000000] NX (Execute Disable) protection: active
                                          [    0.000000] efi: EFI v2.50 by American Megatrends
                                          [    0.000000] efi:  SMBIOS=0x6f0bc000  SMBIOS 3.0=0x6f0bb000  ACPI 2.0=0x6ca22000  ACPI=0x6ca22000  ESRT=0x64ba8f18
                                          [    0.000000] SMBIOS 3.1.1 present.
                                          [    0.000000] DMI: Supermicro SYS-6029P-TR/X11DPi-N, BIOS 2.0b 02/28/2018
                                          [    0.000000] Hypervisor detected: Xen PV
                                          [    0.000480] tsc: Detected 2200.030 MHz processor
                                          [    0.002610] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
                                          [    0.002613] e820: remove [mem 0x000a0000-0x000fffff] usable
                                          [    0.002624] last_pfn = 0x4a52a6 max_arch_pfn = 0x400000000
                                          [    0.002625] Disabled
                                          
                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by

                                            Hmm nothing obvious here…

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post