[Solved] VM Disk Missing

stevezemlicka

I was doing some maintenance and noticed that the disk section of a running VM was empty. I have verified that somehow the primary vhd for this VM has the "is-a-snapshot" attribute set as true with no parent. It also seems that it is listed as a snapshot to itself:

xe vdi-param-list uuid=766d1995-19ba-420f-95e4-30e42dcbc698
uuid ( RO)                    : 766d1995-19ba-420f-95e4-30e42dcbc698
              name-label ( RW): rss01 0
        name-description ( RW): 
           is-a-snapshot ( RO): true
             snapshot-of ( RO): 766d1995-19ba-420f-95e4-30e42dcbc698
               snapshots ( RO): 6714273a-444b-4a21-ad58-b72abb85d6a7; 766d1995-19ba-420f-95e4-30e42dcbc698; 7b3e6fe8-a9f5-4cc1-8f13-d52f474bf7ab
           snapshot-time ( RO): 20241228T06:08:27Z
      allowed-operations (SRO): snapshot; clone
      current-operations (SRO): 
                 sr-uuid ( RO): e5eda81e-540b-029b-f180-20124f81163e
           sr-name-label ( RO): HDD RAID1
               vbd-uuids (SRO): 47e78473-706f-8e36-a017-c0983fdf2560
         crashdump-uuids (SRO): 
            virtual-size ( RO): 214748364800
    physical-utilisation ( RO): 426496
                location ( RO): 766d1995-19ba-420f-95e4-30e42dcbc698
                    type ( RO): System
                sharable ( RO): false
               read-only ( RO): false
            storage-lock ( RO): false
                 managed ( RO): true
     parent ( RO) [DEPRECATED]: <not in database>
                 missing ( RO): false
            is-tools-iso ( RO): false
            other-config (MRW): 
           xenstore-data (MRO): 
               sm-config (MRO): vhd-parent: 03b82421-c7a0-4c13-8d02-52aae2831674; read-caching-enabled-on-0250c976-1a99-4ee3-8b4b-27840941d478: true; host_OpaqueRef:24473335-4516-4675-aba9-ece2b4a46fef: RW
                 on-boot ( RW): persist
           allow-caching ( RW): false
         metadata-latest ( RO): false
        metadata-of-pool ( RO): <not in database>
                    tags (SRW): 
             cbt-enabled ( RO): false

I found a couple of related posts. The first seems to have no resolution and it seems in the second, the solution was to export and import the VM:
https://xcp-ng.org/forum/topic/6981/vmguest-disk-is-missing-in-xen-orchestra/11
https://xcp-ng.org/forum/topic/6336/vm-missing-disk/26

I checked all SRs and none seem to have any coalesce locks so I'm not sure I have any current problems related to coalescence.

I'd like to resolve this in place if possible. Is there a way to set the "is-a-snapshot" flag to false? Is there something else I should check? Or is it best to export and import?

kagbasi-ngc

@stevezemlicka I don't necessarily have a solution, however, I was experiencing similar symptoms in a lab environment (running 3 hosts at v8.3.0) - where the primary disk of several VMs were disappearing at a fairly regular interval. Working with the Vates Support team, we narrowed it down to the Garbage Collection task's interval. Essentially, what I was seeing when looking at the Disks tab of the SR in question, was a set of VDIs (with no name or description) disappearing then re-appearing. Strangely enough, when a disk disappeared from a VM, the VM continued to run as if nothing had happened. Even stranger, during the time when the VDI had disappeared, I could still access the filesystem and manipulate files & folders from inside the VM (which I thought was absolutely magical).

Anyway, while troubleshooting another issue with migrating VDIs to another SR, @olivierlambert asked me to list the contents of the SR in question with the following command: ls -latrh /run/sr-mount/<UUID-of-SR>. When I submitted the output, he noticed a VDI with the text "OLD_" prepended to its name. He explained that perhaps this was an indication of a failed coalescing, and asked me to move (not delete) that file and rescan the SR. I did that, but when I didn't see my issue get resolved, I went ahead and restarted the Xen Toolstack on the master host. As soon as I did this, I noticed my issue got resolved, and as a plus, I observed that the "magical" disappearing of the VDIs was no longer happening. It's been several days now, and I haven't seen any VDIs disappear from any VMs.

So, like I said, not sure if this is a solution for you but perhaps could point you in the right direction.

olivierlambert

Adding @ronan-a and @dthenot as it's maybe related to an issue we are fixing in the future patch available upstream in few days

stevezemlicka

Awesome, great info! I don't recall seeing any atypical VDI names but I don't think I was looking specifically for that. I will re-check the SR and also restart the toolstack. The toolstack restart definitely holds some promise since this is a single server and, as a result, does not get restarted/patched often.

I should get a second server to be able to move critical workloads to (without downtime) and develop a more healthy patch/restart practice even if this is just a homelab. Thanks for that info!

@olivierlambert, if it sounds like this may be related to an issue being worked on and I can provide any helpful info, I'd be happy to gather anything that you think might be relevant.

stevezemlicka

No "OLD_" VHDs in any SR found. Toolstack restart also did not resolve the issue. I will fully patch the server as the next step (probably should have done this earlier).

stevezemlicka

No change after full xcp (8.2.1) patch and reboot. The VM in question started without issue as well.

For reference, I'm using XO commit d044d (currently 11 commits behind) on Master commit 6d34c.

I will leave as is since everything is functional to see if the future commit helps. I should be able to revisit at the end of the week to rebuild on the latest XO commit and test.

stevezemlicka

Just rebuilt XO up to commit a4986. The disk seems to be in the same state. Of course, if this commit includes the fix but the fix specifically prevents this issue, then I'll likely need to take additional steps to correct it to get back into a good state. Or, maybe that fix isn't yet in the commit (or maybe I have a situation that isn't addressed by the fix).

With regards to correcting the issue, I think there are a few options:

Export and import VM
Create 2nd disk, clone the disk from within the VM to the new disk, and boot to the good disk.
Rebuild the VM
Manually flip the disk's "is-a-snapshot" flag to false. Not sure of the implications of doing this.

stevezemlicka

I rebuilt XO to build 7579b with no change. I performed an export and import of the VM and the imported VM functioned normally and displayed the disk in the UI appropriately.