Migrating VM fails with DUPLICATE_VM error part2

olivierlambert

Okay and on the host in question, can you put a xe host-param-list?

AndreS

uid ( RO)                                  : 03688598-ec14-4492-ab46-5424dcee8e9f
                            name-label ( RW): xen2
                      name-description ( RW): Dell T20
                    allowed-operations (SRO): VM.migrate; provision; VM.resume; evacuate; VM.start
                    current-operations (SRO): 
                               enabled ( RO): true
                               display ( RO): enabled
                     API-version-major ( RO): 2
                     API-version-minor ( RO): 16
                    API-version-vendor ( RO): XenSource
     API-version-vendor-implementation (MRO): 
                               logging (MRW): 
                 suspend-image-sr-uuid ( RW): a610de27-9c64-ee06-a8fd-4e1d1c7768ab
                    crash-dump-sr-uuid ( RW): a610de27-9c64-ee06-a8fd-4e1d1c7768ab
                      software-version (MRO): product_version: 8.2.0; product_version_text: 8.2; product_version_text_short: 8.2; platform_name: XCP; platform_version: 3.2.0; product_brand: XCP-ng; build_number: release/stockholm/master/7; hostname: localhost; date: 2021-05-20; dbv: 0.0.1; xapi: 1.20; xen: 4.13.1-9.12.1; linux: 4.19.0+1; xencenter_min: 2.16; xencenter_max: 2.16; network_backend: openvswitch; db_schema: 5.602
                          capabilities (SRO): xen-3.0-x86_64; xen-3.0-x86_32p; hvm-3.0-x86_32; hvm-3.0-x86_32p; hvm-3.0-x86_64; 
                          other-config (MRW): agent_start_time: 1636223713.; boot_time: 1632048708.; rpm_patch_installation_time: 1632048285.199; iscsi_iqn: iqn.2020-07.com.example:67db4a3c
                              cpu_info (MRO): cpu_count: 4; socket_count: 1; vendor: GenuineIntel; speed: 3192.841; modelname: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz; family: 6; model: 60; stepping: 3; flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid pn
-00000000-00000000-00000000-00000000; features_hvm: 1fcbfbff-f7fa3223-2d93fbff-00000423-00000001-000007ab-00000000-00000000-00001000-9c000400-00000000-00000000-00000000-00000000-00000000; features_hvm_host: 1fcbfbff-f7fa3223-2d93fbff-00000423-00000001-000007ab-00000000-00000000-00001000-9c000400-00000000-00000000-00000000-00000000-00000000; features_pv_host: 1fc9cbf5-f6f83203-2991cbf5-00000023-00000001-00000329-00000000-00000000-00001000-8c000400-00000000-00000000-00000000-00000000-00000000
                          chipset-info (MRO): iommu: true
                              hostname ( RO): xen2
                               address ( RO): 172.25.10.12
                 supported-bootloaders (SRO): pygrub; eliloader
                                 blobs ( RO): 
                       memory-overhead ( RO): 621060096
                          memory-total ( RO): 25673416704
                           memory-free ( RO): 16197537792
                  memory-free-computed ( RO): 3970736128
                     host-metrics-live ( RO): true
                  patches (SRO) [DEPRECATED]: 
                               updates (SRO): 
                         ha-statefiles ( RO): 
                      ha-network-peers ( RO): 
                    external-auth-type ( RO): 
            external-auth-service-name ( RO): 
           external-auth-configuration (MRO): 
                               edition ( RO): xcp-ng
                        license-server (MRO): address: localhost; port: 27000
                         power-on-mode ( RO): 
                       power-on-config (MRO): 
                        local-cache-sr ( RO): <not in database>
                                  tags (SRW): 
                            ssl-legacy ( RW): false
                    guest_VCPUs_params (MRW): 
    virtual-hardware-platform-versions (SRO): 0; 1; 2
                   control-domain-uuid ( RO): 66c5258b-6429-4658-b31c-06ccd0f1896d
                          resident-vms (SRO): 90ee64f7-9a07-fd83-033a-10183b98a9a6; c318a4e3-a14d-01ca-b2c4-df11c1f9d9b8; 66c5258b-6429-4658-b31c-06ccd0f1896d; 49c86f89-eabb-bd68-dc79-387434bdb899
              updates-requiring-reboot (SRO): 
                              features (SRO): 
                             iscsi_iqn ( RW): iqn.2020-07.com.example:67db4a3c
                          multipathing ( RW): false

olivierlambert

That's weird, you have a big diff between memory free and free computed. Ie you don't have enough free memory (computed) to boot the VM right now.

edit: maybe you have dynamic memory for some VMs, using more than you think.

AndreS

@olivierlambert

Dynamic, yes, but total is still very low. Everywhere I look memory usage is just under 9 Gb including 2.2 Gb for xcp-ng. Host has 24 Gb. Should I just reboot to have a clean start?

olivierlambert

Let's try a reboot, with no VMs up, except the one you want to boot.

AndreS

@olivierlambert
I stopped all other VMs and then I was able to start the Win10vm again.

Memory stettings for this VM
Static: 6 GiB/6 GiB
Dynamic: 6 GiB/6 GiB

I'll do a reboot anyway because I have never had this before; normally I have 4-5 vms running and a few Gb of memory free. With the dynamice memory settings. I can only guess that the migrations used memory that was not freed up.

I will reboot both hosts, just to be sure; check the mac_seed and set it to something different and try again.

olivierlambert

Thanks for your feedback

AndreS

@olivierlambert Rebooted both hosts, everything came up as expected.
Stopped the Win10vm vm, changed the mac_seed and started the migration.
Initially it looked ok but it finished with an error:

xe vm-migrate remote-master=172.25.10.11 remote-username=root remote-password=xxxxxx vif:f4b175c2-0082-212c-b9d9-bd616cd83d2c=a014b230-2db6-adb4-ba4f-0b1cc07fdcae vm=Win10vm
Performing a Storage XenMotion migration. Your VM's VDIs will be migrated with the VM.
Will migrate to remote host: xen1, using remote network: Pool-wide network associated with eth0. Here is the VDI mapping:
VDI 4d4a809d-6801-4462-8e52-811882106821 -> SR 270f8f4a-a24c-ced6-99c7-9bc2ba5f5008
VDI 6a30ca10-a386-4e00-91aa-89c3e5bd43de -> SR 270f8f4a-a24c-ced6-99c7-9bc2ba5f5008
The VDI copy action has failed
<extra>: End_of_file

Via XO

vm.migrate
{
  "vm": "afe623be-5451-fd48-3f24-60120e53f5ab",
  "mapVifsNetworks": {
    "f4b175c2-0082-212c-b9d9-bd616cd83d2c": "a014b230-2db6-adb4-ba4f-0b1cc07fdcae"
  },
  "migrationNetwork": "a014b230-2db6-adb4-ba4f-0b1cc07fdcae",
  "sr": "270f8f4a-a24c-ced6-99c7-9bc2ba5f5008",
  "targetHost": "3b57d90b-983f-46bb-8f52-4319025d1182"
}
{
  "code": 21,
  "data": {
    "objectId": "afe623be-5451-fd48-3f24-60120e53f5ab",
    "code": "VDI_COPY_FAILED"
  },
  "message": "operation failed",
  "name": "XoError",
  "stack": "XoError: operation failed
    at operationFailed (/opt/xo/xo-builds/xen-orchestra-202111061638/packages/xo-common/src/api-errors.js:21:32)
    at file:///opt/xo/xo-builds/xen-orchestra-202111061638/packages/xo-server/src/api/vm.mjs:482:15
    at Object.migrate (file:///opt/xo/xo-builds/xen-orchestra-202111061638/packages/xo-server/src/api/vm.mjs:469:3)
    at Api.callApiMethod (file:///opt/xo/xo-builds/xen-orchestra-202111061638/packages/xo-server/src/xo-mixins/api.mjs:304:20)"
}

I have however succesfully migrated the 'other' vm that was imported (and had the duplicate mac_seed) . So both are now running on the same host.

So next to the mac_seed there seems to be something wrong with the imported windows vm; it starts horribly slow indeed.

AndreS

OK, the underlying disk has problems. I found the kernel.log and it is filling up with read errors. That part is clear now.

So what lead me to try and migrate the vm (slow boot/response) uncovered a mac_seed duplication and we fixed that. Not sure if this was already fixed in the code last year.

On to this prblem; could I have seen the disk issue somewhere in the XO interface? In the logs I only see the higher level issue of failed migrations. It's a learning experience anyway; maybe a mirrored disk could be an answer. Need to investigate.

olivierlambert

Interesting to discover that mac_seed could cause this cryptic issue

XO doesn't have any knowledge on the "state" of the virtual disk, there's no API to expose that (if we can even imagine a way to know that the virtual disk got a problem in the first place)

AndreS

@olivierlambert Indeed. Learned a lot this week. Thanks for walking me through this.

Should I register anything for mac_seed duplication that is obviously caused by the export from esxi/import into xcp-ng? The MAC address duplciation issue was fixed earlier if I am not mistaken.

olivierlambert

I'd like to see if we can reproduce this. If yes, then we need to be careful on this parameter on OVA import, indeed. Ping @Darkbeldin so we can discuss on how to reproduce the problem first.

Darkbeldin

@olivierlambert Will take a look at it tomorrow morning

AndreS

@darkbeldin Thanks! If I can test anything let me know. Do note that I have removed the degraded disk from the host and said goodbye to three vm's that I could not copy/migrate/clone anymore. Not a big loss, this is a home lab but I could not even copy them away to preserve them for later. I do have the other vm (I imported two last year from esxi).

Darkbeldin

@andres Hi Andres,

Still working on it at the moment i keep you updated when i have more infos.

Darkbeldin

@andres Ok after discussing with XAPI team about that apparently the behavior is intended, you should not have VMs with the same MAC_SEED on the same host.

Dev team will take a deeper look at this to see if at least the error could be clearly reported.
To go further could you please provide us with the means you used to migrate your VMs from VMWare?
Did you do it manually? did you export from VMware to import to XCP?

AndreS

@darkbeldin We are talking 15-18 months ago ... What I remember is a fairly standard export from esxi into (I believe) an OVA format and I imported that directly into xcp-ng. I don't remember anything complex other than having to redo the export a few times to get properly rid of vmware tools and some drivers.

Edit: I checked a few articles and it was probably the OVF format.

Darkbeldin

@andres Thanks will do some testing on my side.

AndreS

@darkbeldin For the record, both vm's were running on the same esxi host when exported. This may or may not be a factor (I suspect it is).