XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Every VM in a CR backup job creates an "Unhealthy VDI"

    Scheduled Pinned Locked Moved Backup
    10 Posts 4 Posters 50 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Online
      joeymorin
      last edited by

      Greetings,

      I'm experimenting with CR backups in a test environment. I have a nightly CR backup job, currently for 4 VMs, all going to the same SR, '4TB on antoni'. On the first incremental (second backup after the initial full) an unhealthy VDI is reported under dashboard/health... one for every VM in the job. A subsequent incremental result in an additional reported unhealthy VDI, again one for each VM.

      For example:
      8629d627-af26-42c0-b5a4-955aebc686a6-image.png
      The following VMs each currently have the initial full, and three subsequent incrementals in the CR chain:

      • HR-FS
      • maryjane
      • zuul

      Note that there are three reported unhealthy VDIs for each.

      The remaining VM, exocomp, currently has only 1 incremental after the initial full, and there is one reported unhealthy VDI for that VM.

      Is this normal? If not, what details can I provide that might help get to the bottom of this?

      A 1 Reply Last reply Reply Quote 0
      • A Offline
        Andrew Top contributor @joeymorin
        last edited by

        @joeymorin That's correct. They need time to coalesce after snapshots change. Length of 1 is normal. They should clear up after a few minutes.

        J 1 Reply Last reply Reply Quote 0
        • J Online
          joeymorin @Andrew
          last edited by

          @Andrew, they do not clear up. Please read my OP carefully and look at the screenshot. They remain forever. They accumulate, one for each VM for every incremental. Nightly CR, four VMs, four more unhealthy VDIs. Tomorrow night, four more, etc.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            We made some fixes very recently (yesterday), can you check you are on latest commit? (if XO sources)

            J 1 Reply Last reply Reply Quote 1
            • J Online
              joeymorin @olivierlambert
              last edited by

              I rebuild XO nightly at 11:25 UTC.

              These fixes, would they stop the accumulation of unhealthy VDIs for existing CR chains already manifesting them? Or should I purge all of the CR VMs and snapshots?

              As I type, I'm on 2d066, which is the latest. The CR job runs at 02:00 UTC, so had just run when I posted my OP. All of the unhealthy VDIs reported then are still reported now.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                You have to check your host SMlog to see if you have a coalesce issue

                J 1 Reply Last reply Reply Quote 0
                • J Online
                  joeymorin @olivierlambert
                  last edited by joeymorin

                  Three separate hosts are involved. HR-FS and zuul are on one, maryjane on the second, exocomp on the third.

                  Total, over 17,000 lines in SMlog for the hour during the CR job. No errors, no corruptions, no exceptions.

                  Actually, there are some reported exceptions and corruptions on farmer, but none that involve these VMs or this CR job. A fifth VM not part of the job has a corruption that I'm still investigating, but it's on a test VM I don't care about. The VM HR-FS does have a long-standing coalesce issue where two .vhd files always remain, the logs showing:

                  FAILED in util.pread: (rc 22) stdout: '/var/run/sr-mount/7bc12cff- ... -ce096c635e66.vhd not created by xen; resize not supported
                  

                  ... but this long predates the CR job, and seems related to the manner in which the original .vhd file was created on the host. It doesn't seem relevant, since three other VMs with no history of exceptions/errors in SMlog are showing the same unhealthy VDI behaviour, and two of those aren't even on the same host. One is on a separate pool.

                  SMlog is thick and somewhat inscrutable to me. Is there a specific message I should be looking for?

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by olivierlambert

                    Can you grep on the word exception? (with -i to make sure you get them all)

                    J 1 Reply Last reply Reply Quote 0
                    • J Online
                      joeymorin @olivierlambert
                      last edited by

                      [09:24 farmer ~]# zcat /var/log/SMlog.{31..2}.gz | cat - /var/log/SMlog.1 /var/log/SMlog | grep -i "nov 12 21" | grep -i -e exception -e e.x.c.e.p.t.i.o.n
                      
                      Nov 12 21:12:51 farmer SMGC: [17592]          *  E X C E P T I O N  *
                      Nov 12 21:12:51 farmer SMGC: [17592] coalesce: EXCEPTION <class 'util.CommandException'>, Invalid argument
                      Nov 12 21:12:51 farmer SMGC: [17592]     raise CommandException(rc, str(cmdlist), stderr.strip())
                      Nov 12 21:16:52 farmer SMGC: [17592]          *  E X C E P T I O N  *
                      Nov 12 21:16:52 farmer SMGC: [17592] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VHD *6c411334(8.002G/468.930M) corrupted
                      Nov 12 21:16:52 farmer SMGC: [17592]     raise util.SMException("VHD %s corrupted" % self)
                      Nov 12 21:16:54 farmer SMGC: [17592]          *  E X C E P T I O N  *
                      Nov 12 21:16:54 farmer SMGC: [17592] coalesce: EXCEPTION <class 'util.SMException'>, VHD *6c411334(8.002G/468.930M) corrupted
                      Nov 12 21:16:54 farmer SMGC: [17592]     raise util.SMException("VHD %s corrupted" % self)
                      

                      None relevant to the CR job. The one at 21:12:51 local time is related to the 'resize not supported' issue I mention above. The two at 21:16:52 and 21:16:54 are related to a fifth VM not in the CR job (the test VM I don't care about, but may continue to investigate).

                      The other two hosts' SMlog are clean.

                      A 1 Reply Last reply Reply Quote 0
                      • A Online
                        acebmxer @joeymorin
                        last edited by

                        @joeymorin

                        If its any help check out my post - https://xcp-ng.org/forum/topic/11525/unhealthy-vdis/4

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post