XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Short VM freeze when migrating to another host

    Scheduled Pinned Locked Moved Compute
    33 Posts 8 Posters 6.7k Views 10 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      arc1
      last edited by

      Hi,
      We have XCP-ng 8.2.1 hosts with latest xen orchestra.
      When we migrate VMs (mostly RockyLinux 9 hosts, some CentOS 7 too) there is a mini freeze of vm. VMs with databases/etcd or any other more sensitive programs they report an error for a shot moment. VM continue to work without any issue, but still, is there any solution to that freeze?
      Kind regards!

      Rockylinux9 error (On CentOS 7 we get similar error):

      Aug 08 13:46:38 rocky9linux kernel: Freezing user space processes ... (elapsed 0.003 seconds) done.
      Aug 08 13:46:38 rocky9linux kernel: OOM killer disabled.
      Aug 08 13:46:38 rocky9linux kernel: Freezing remaining freezable tasks ... (elapsed 0.006 seconds) done.
      Aug 08 13:46:38 rocky9linux kernel: ------------[ cut here ]------------
      Aug 08 13:46:38 rocky9linux kernel: WARNING: CPU: 1 PID: 2176896 at kernel/workqueue.c:3162 __flush_work.isra.0+0x212/0x230
      Aug 08 13:46:38 rocky9linux kernel: Modules linked in: tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vfat fat ppdev joydev pcspkr bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt parport_pc parport i2c_piix4 drm fuse xfs libcrc32c sr_mod cdrom sg ata_generic ata_piix libata xen_netfront xen_blkfront crc32c_intel serio_raw dm_mirror dm_region_hash dm_log dm_mod
      Aug 08 13:46:38 rocky9linux kernel: CPU: 1 PID: 2176896 Comm: kworker/u128:4 Kdump: loaded Tainted: G        W         -------  ---  5.14.0-362.8.1.el9_3.x86_64 #1
      Aug 08 13:46:38 rocky9linux kernel: Hardware name: Xen HVM domU, BIOS 4.13 01/31/2024
      Aug 08 13:46:38 rocky9linux kernel: Workqueue: events_unbound async_run_entry_fn
      Aug 08 13:46:38 rocky9linux kernel: RIP: 0010:__flush_work.isra.0+0x212/0x230
      Aug 08 13:46:38 rocky9linux kernel: Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 25 ff ff ff 0f 0b e9 4e ff ff ff <0f> 0b 45 31 ed e9 44 ff ff ff e8 df 89 b2 00 66 66 2e 0f 1f 84 00
      Aug 08 13:46:38 rocky9linux kernel: RSP: 0018:ffffa2f1850afcb8 EFLAGS: 00010246
      Aug 08 13:46:38 rocky9linux kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffa9b929b7
      Aug 08 13:46:38 rocky9linux kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8d6487f2cb30
      Aug 08 13:46:38 rocky9linux kernel: RBP: ffff8d6487f2cb30 R08: 0000000000000000 R09: ffff8d638e1021f4
      Aug 08 13:46:38 rocky9linux kernel: R10: 000000000000000f R11: 000000000000000f R12: ffff8d6487f2cb30
      Aug 08 13:46:38 rocky9linux kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
      Aug 08 13:46:38 rocky9linux kernel: FS:  0000000000000000(0000) GS:ffff8d648a640000(0000) knlGS:0000000000000000
      Aug 08 13:46:38 rocky9linux kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Aug 08 13:46:38 rocky9linux kernel: CR2: 00007fd03d4be2a2 CR3: 000000000302c006 CR4: 00000000000206e0
      Aug 08 13:46:38 rocky9linux kernel: Call Trace:
      Aug 08 13:46:38 rocky9linux kernel:  <TASK>
      Aug 08 13:46:38 rocky9linux kernel:  ? show_trace_log_lvl+0x1c4/0x2df
      Aug 08 13:46:38 rocky9linux kernel:  ? show_trace_log_lvl+0x1c4/0x2df
      Aug 08 13:46:38 rocky9linux kernel:  ? __cancel_work_timer+0x103/0x190
      Aug 08 13:46:38 rocky9linux kernel:  ? __flush_work.isra.0+0x212/0x230
      Aug 08 13:46:38 rocky9linux kernel:  ? __warn+0x81/0x110
      Aug 08 13:46:38 rocky9linux kernel:  ? __flush_work.isra.0+0x212/0x230
      Aug 08 13:46:38 rocky9linux kernel:  ? report_bug+0x10a/0x140
      Aug 08 13:46:38 rocky9linux kernel:  ? handle_bug+0x3c/0x70
      Aug 08 13:46:38 rocky9linux kernel:  ? exc_invalid_op+0x14/0x70
      Aug 08 13:46:38 rocky9linux kernel:  ? asm_exc_invalid_op+0x16/0x20
      Aug 08 13:46:38 rocky9linux kernel:  ? __flush_work.isra.0+0x212/0x230
      Aug 08 13:46:38 rocky9linux kernel:  __cancel_work_timer+0x103/0x190
      Aug 08 13:46:38 rocky9linux kernel:  ? set_next_entity+0xda/0x150
      Aug 08 13:46:38 rocky9linux kernel:  drm_kms_helper_poll_disable+0x1e/0x40 [drm_kms_helper]
      Aug 08 13:46:38 rocky9linux kernel:  drm_mode_config_helper_suspend+0x1c/0x80 [drm_kms_helper]
      Aug 08 13:46:38 rocky9linux kernel:  pci_pm_freeze+0x53/0xc0
      Aug 08 13:46:38 rocky9linux kernel:  ? __pfx_pci_pm_freeze+0x10/0x10
      Aug 08 13:46:38 rocky9linux kernel:  dpm_run_callback+0x4c/0x140
      Aug 08 13:46:38 rocky9linux kernel:  __device_suspend+0x112/0x470
      Aug 08 13:46:38 rocky9linux kernel:  async_suspend+0x1b/0x90
      Aug 08 13:46:38 rocky9linux kernel:  async_run_entry_fn+0x30/0x130
      Aug 08 13:46:38 rocky9linux kernel:  process_one_work+0x1e5/0x3b0
      Aug 08 13:46:38 rocky9linux kernel:  worker_thread+0x50/0x3a0
      Aug 08 13:46:38 rocky9linux kernel:  ? __pfx_worker_thread+0x10/0x10
      Aug 08 13:46:38 rocky9linux kernel:  kthread+0xe0/0x100
      Aug 08 13:46:38 rocky9linux kernel:  ? __pfx_kthread+0x10/0x10
      Aug 08 13:46:38 rocky9linux kernel:  ret_from_fork+0x2c/0x50
      Aug 08 13:46:38 rocky9linux kernel:  </TASK>
      Aug 08 13:46:38 rocky9linux kernel: ---[ end trace 18c4db6d6eef5f95 ]---
      Aug 08 13:46:38 rocky9linux kernel: suspending xenstore...
      Aug 08 13:46:38 rocky9linux kernel: xen:grant_table: Grant tables using version 1 layout
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=9, pirq=16
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=8, pirq=17
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=12, pirq=18
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=1, pirq=19
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=6, pirq=20
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=4, pirq=21
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=7, pirq=22
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=23, pirq=23
      Aug 08 13:46:38 rocky9linux kernel: xen: --> irq=28, pirq=24
      Aug 08 13:46:38 rocky9linux kernel: usb usb1: root hub lost power or was reset
      Aug 08 13:46:38 rocky9linux kernel: ata2: found unknown device (class 0)
      Aug 08 13:46:38 rocky9linux kernel: usb 1-2: reset full-speed USB device number 2 using uhci_hcd
      Aug 08 13:46:38 rocky9linux kernel: OOM killer enabled.
      Aug 08 13:46:38 rocky9linux kernel: Restarting tasks ... done.
      Aug 08 13:46:38 rocky9linux NetworkManager[687]: <info>  [1723117598.8391] device (enX0): carrier: link connected
      Aug 08 13:46:38 rocky9linux kernel: Setting capacity to 41943040
      Aug 08 13:46:39 rocky9linux xe-daemon[669]: Trigger refresh after system resume
      
      1 Reply Last reply Reply Quote 0
      • nikadeN Offline
        nikade Top contributor
        last edited by

        Thats weird, I have many VM's running with SQL Server, MySQL and PostgreSQL and they are migrated just fine.
        Some questions:

        1. Are VM tools installed?
        2. What network speed do you have?
        3. How much RAM does that XCP-NG host have assigned to dom0?
        4. Are you using dynamic memory?
        A 1 Reply Last reply Reply Quote 1
        • A Offline
          arc1 @nikade
          last edited by arc1

          Hi, @nikade thank you for fast answer. Which OS are you using?

          Here are information of my setup:
          1. Are VM tools installed?
          Yes they are - version 7.30.0-7.el9

          2. What network speed do you have?
          2x 25GB in LACP mode on hosts with 4 paths to ISCSI storage.

          3. How much RAM does that XCP-NG host have assigned to dom0?
          8gb - host has 768GB.

          4. Are you using dynamic memory?
          No, dynamic memory is not enabled (on vm advanced setting dynamic is: 16/16GiB).

          Thank you for help again!

          nikadeN 1 Reply Last reply Reply Quote 0
          • nikadeN Offline
            nikade Top contributor @arc1
            last edited by

            @arc1 we're using both Windows Server 2012 R2 (I know its EOL), Windows Server 2016 and Windows Server 2022 with SQL Server.
            We're using Debian 11/12 for mysql and Ubuntu for PostgreSQL so quite the mix 🙂

            Do you lose ping when pinging the VM while migrating?

            A 1 Reply Last reply Reply Quote 1
            • A Offline
              arc1 @nikade
              last edited by arc1

              @nikade Yes, we loose 4 (+-1) ping usually.
              The freeze occurs on Xen Orchestra VM too which is Debian 11 (only Debian wbased VM in our enviroment).

              planedropP nikadeN 2 Replies Last reply Reply Quote 0
              • planedropP Offline
                planedrop Top contributor @arc1
                last edited by

                @arc1 Think you can give it a try with a Windows VM just to see if the problem goes away (not SQL but just pinging)? Would help diagnose if it's your infrastracture somehow or an XCP-ng specific thing with just certain Linux VMs.

                I so far haven't seen behavior like this though.

                1 Reply Last reply Reply Quote 1
                • nikadeN Offline
                  nikade Top contributor @arc1
                  last edited by

                  @arc1 said in Short VM freeze when migrating to another host:

                  @nikade Yes, we loose 4 (+-1) ping usually.
                  The freeze occurs on Xen Orchestra VM too which is Debian 11 (only Debian wbased VM in our enviroment).

                  Thats very strange, I've played around a bit at work and our VM's do not freeze when migrating between the hosts. I dont even loose a single ping.
                  In our setup each host has 2x10G in a LACP bond with the mgmt and all the rest of the VLAN's on top of that bond, dom0 has 16Gb ram but I dont think that matters a whole lot.

                  What about MAC aging or similar? I mean since the VM's MAC is moved from another switch port to another the learning ttl has to be reached before the switch will know where to send the new traffic.

                  1 Reply Last reply Reply Quote 1
                  • Z Offline
                    zmk
                    last edited by

                    There is always short-term freeze when migrating to another host.
                    How short is this short-term freeze?
                    If it is so short that no one notices it, then no one notices it...

                    1 Reply Last reply Reply Quote 0
                    • nikadeN Offline
                      nikade Top contributor
                      last edited by

                      I cant even notice it, tried moving the mouse around, having task manager up in a windows vm, top in a linux vm and i cant really notice a freeze when i migrated my vm's around.

                      Maybe it depends on how much ram the vm has? My test VM's only have 2-8Gb ram.

                      Z 1 Reply Last reply Reply Quote 0
                      • Z Offline
                        zmk @nikade
                        last edited by

                        It depends on how much RAM has not yet been copied to the new VM-server at the time of the freeze.

                        If a test virtual machine does virtually nothing, then there are not many changes in its memory.

                        1 Reply Last reply Reply Quote 0
                        • A Offline
                          arc1
                          last edited by

                          @nikade @planedrop @zmk Thank you all for answering.
                          We did the test with RockyLinux, Centos 7, Ubuntu 22.04 and Windows Server 2022.
                          On the Windows Server we only loose a few pings (10 pings in testing enviroment) on Linux we see logs about VM freeze too.
                          Windows VM isn't busy at all, only test VM but we loose about 10 pings.

                          Vates support said that "depending on the load and the Ram size you can have some freeze of the VM during migration, unfortunately at the moment there is not a lot that can be done about that".

                          I'm just curious why @nikade and @planedrop don't get any freeze.

                          R nikadeN 2 Replies Last reply Reply Quote 0
                          • R Offline
                            rfx77 @arc1
                            last edited by

                            @arc1 same situation here. we also had dmesg entries when doing live-migration. but the vm did not have any issues beside that.

                            1 Reply Last reply Reply Quote 1
                            • Z Offline
                              zmk
                              last edited by zmk

                              What could be the algorithm for copying the RAM of a running virtual machine to another host?

                              1. Copy the RAM of the running VM to another host.
                              2. While the copying was in progress, the RAM of the running VM has already changed.
                              3. Copy the changes.
                              4. While the copying was in progress, the RAM of the running VM has already changed.
                              5. Copy the changes.

                              Finally, we understand that this is an infinite loop.
                              Freeze the running virtual machine.
                              The RAM of the non-running virtual machine no longer changes.
                              Copy the changes RAM of the non-running virtual machine.
                              After copying the changes, the RAM of the non-running VM on the old host matches the RAM of the VM on the new host.
                              Unfreeze the VM on the new host.

                              The more uncopied changes at the time of freezing, the longer the freezing time.

                              Copying of uncopied changes after freezing cannot happen instantly.

                              R 1 Reply Last reply Reply Quote 1
                              • R Offline
                                rfx77 @zmk
                                last edited by

                                @zmk We only had the dmesg entris on Xen, not on VMWare and not on HyperV

                                1 Reply Last reply Reply Quote 0
                                • nikadeN Offline
                                  nikade Top contributor @arc1
                                  last edited by

                                  @arc1 how much ram/cpu/disk does your VM's have?
                                  Seems like something is taking too long in the last phase of the migration, when the original source and destination VM are syncronized.

                                  A 1 Reply Last reply Reply Quote 0
                                  • Z Offline
                                    zmk
                                    last edited by

                                    The problem may be in the transfer speed between hosts.

                                    nikadeN 1 Reply Last reply Reply Quote 0
                                    • nikadeN Offline
                                      nikade Top contributor @zmk
                                      last edited by

                                      @zmk yeah maybe, we're connected with 2x10G on each host to the network and while doing a migration (without storage migration) between 2 hosts in the pool I can see it spike at 6-7Gbit/s.

                                      1 Reply Last reply Reply Quote 0
                                      • A Offline
                                        arc1 @nikade
                                        last edited by

                                        @nikade 4cpu, 16ram and roughly 200gb disk.
                                        10ping downtime was on test enviroment with slower speeds between hosts, so this explains longer freeze.
                                        But on production 2x25gb lacp is still noticable freeze on VMs with more sensitive software (keepalived/etcd).Nothing too terrible we were just curious if this is normal behaviour.

                                        nikadeN 1 Reply Last reply Reply Quote 0
                                        • nikadeN Offline
                                          nikade Top contributor @arc1
                                          last edited by

                                          @arc1 so if you go to XOA and the console of the VM, what happends then?
                                          Is the VM frozen for the amount of 10 pings? Open taskmanager to see if there is any CPU activity.

                                          A 1 Reply Last reply Reply Quote 0
                                          • A Offline
                                            arc1 @nikade
                                            last edited by

                                            @nikade Yes, the MV is frozen without cpu activity.

                                            nikadeN 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post