XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XOSTOR Global network disruption test

    Scheduled Pinned Locked Moved XOSTOR
    1 Posts 1 Posters 19 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • henri9813H Offline
      henri9813
      last edited by henri9813

      Summary

      This test permit to cover the following scenario:

      • Storage network is down
      • All networks are down

      Impact:

      • All hosts cannot see each others
      • Linstor DRDB replication can no longer works ( everyone will be in readonly ).

      Expected results:

      • All write operation on VM fails.
      • a reboot solve the issue.

      Environments

      • 3 hypervisors
        • Node 1
        • Node 2
        • Node 3 ( Master )
      • 13 vms ( mixed of Windows Server and Rocky Linux VMs on XFS, No kubernetes in this section ).
      • VM observation point is : VM1

      We didn't tests other filesystem than XFS for Linux based operating system because we use only XFS.

      Procedure

      • Unplug network cable from all non-master node:
        • Node 1
        • Node 2
      • Keep networks only on the master xcpng node to keep management and observe the behavior
      • Access to a VM located in the master node ( which is still reacheable ).
      • Try to write on VM1, ensure that you have "I/O" error.
      • Wait 5 minutes
      • Re-add back node 1 & Node 2
      • Check states of all VMs.
        • Reboot them if needed.

      Execution

      • Cable disconnected from node 1 and node 2
      • From VM1, we have
      [hdevigne@VM1 ~]$ htop^C
      [hdevigne@VM1 ~]$ echo "coucou" > test
      -bash: test: Input/output error
      [hdevigne@VM1 ~]$ dmesg
      -bash: /usr/bin/dmesg: Input/output error
      [hdevigne@VM1 ~]$ d^C
      [hdevigne@VM1 ~]$ sudo -i
      -bash: sudo: command not found
      [hdevigne@VM1 ~]$ dm^C
      [hdevigne@VM1 ~]$ sudo -i
      -bash: sudo: command not found
      [hdevigne@VM1 ~]$ dmesg
      -bash: /usr/bin/dmesg: Input/output error
      [hdevigne@VM1 ~]$ mount
      -bash: mount: command not found
      [hdevigne@VM1 ~]$ sud o-i
      -bash: sud: command not found
      [hdevigne@VM1 ~]$ sudo -i
      

      ✅ As we predicted it, the vm is completly fucked-up 😄

      • Windows VM crash and reboot in loop.

      • Linstor controller was on node 1, so we will not be able to see linstor nodes status, but we supposed they are in "disconnected" and in "pending eviction", but that doesn't matter a lot, disks are in read only, vm are fucked up after writing, it was our expected bevahior.

      • Re-plug node 1 and node 2.

      • Windows boot normally
        481f9594-2469-4793-9a84-6bd5082caf78-image.png

      • Linux VM stays in a "broken state"

      ➜  ~ ssh VM1 
      suConnection closed by UNKNOWN port 65535
      
      • Force rebooting all VMs from Xen-orchestra permit to revert all vms to a correct state

      Limitation of the test

      We didn't test a duration up to the eviction states of linstor nodes, but the documentation show that a linstor node restore would works ( see https://docs.xcp-ng.org/xostor/#what-to-do-when-a-node-is-in-an-evicted-state )
      We didn't use HA at this time in the cluster, that could helped a bit in the recovery process. but in a precedent experience that i didn't "historize" like this one, the HA was completely down because it was not able to mount a file, i will probably write another topic on the forum to bring my results public.

      Important notes

      Having HA change the criticity of the following note.

      • This test show us that while we don't have HA, all Management components should NOT be placed in XOSTOR to avoid loosing access to it uppon reboot of the VM.
        • If we maintain the idea to put Management component ( XO, Firewall etc... ) in the xostor without HA, we aim to increase recovery time because the recovery will be "manual" from IPMI.
      • Maybe should we simply force reboot nodes after network recovery ? but a bit violent, HA works like this.

      Credit

      Thanks to @olivierlambert, @ronan and other people on the discord canal for answering to daily question which permit to this kind of tests to be made. As promissed, i put my result online 🙂

      Thanks for XOSTOR.

      Futher tests to do: Retry with HA

      1 Reply Last reply Reply Quote 0
      • First post
        Last post