henri9813

henri9813

Hello @ronan-a

I will reproduce the case, i will re-destroy one hypervisor and retrigger the case.

Thank you @ronan-a et @olivierlambert

If you need me to tests some special case don't hesit, we have a pool dedicated for this

henri9813

Hello, @ronan-a

I will reinstall my hypervisor this week.

I will reproduce it and then, resend you the logs.

Bonne journée,

henri9813

Hello, @DustinB

The https://vates.tech/xostor/ says:

The maximum size of any single Virtual Disk Image (VDI) will always be limited by the smallest disk in your cluster.

But in this case, maybe it can be stored in the "2TB disks" ? Maybe others can answer, i didn't test it.

henri9813

Summary

This test permit to cover the following scenario:

Storage network is down
All networks are down

Impact:

All hosts cannot see each others
Linstor DRDB replication can no longer works ( everyone will be in readonly ).

Expected results:

All write operation on VM fails.
a reboot solve the issue.

Environments

3 hypervisors
- Node 1
- Node 2
- Node 3 ( Master )
13 vms ( mixed of Windows Server and Rocky Linux VMs on XFS, No kubernetes in this section ).
VM observation point is : VM1

We didn't tests other filesystem than XFS for Linux based operating system because we use only XFS.

Procedure

Unplug network cable from all non-master node:
- Node 1
- Node 2
Keep networks only on the master xcpng node to keep management and observe the behavior
Access to a VM located in the master node ( which is still reacheable ).
Try to write on VM1, ensure that you have "I/O" error.
Wait 5 minutes
Re-add back node 1 & Node 2
Check states of all VMs.
- Reboot them if needed.

Execution

Cable disconnected from node 1 and node 2
From VM1, we have

[hdevigne@VM1 ~]$ htop^C
[hdevigne@VM1 ~]$ echo "coucou" > test
-bash: test: Input/output error
[hdevigne@VM1 ~]$ dmesg
-bash: /usr/bin/dmesg: Input/output error
[hdevigne@VM1 ~]$ d^C
[hdevigne@VM1 ~]$ sudo -i
-bash: sudo: command not found
[hdevigne@VM1 ~]$ dm^C
[hdevigne@VM1 ~]$ sudo -i
-bash: sudo: command not found
[hdevigne@VM1 ~]$ dmesg
-bash: /usr/bin/dmesg: Input/output error
[hdevigne@VM1 ~]$ mount
-bash: mount: command not found
[hdevigne@VM1 ~]$ sud o-i
-bash: sud: command not found
[hdevigne@VM1 ~]$ sudo -i

As we predicted it, the vm is completly fucked-up

Windows VM crash and reboot in loop.
Linstor controller was on node 1, so we will not be able to see linstor nodes status, but we supposed they are in "disconnected" and in "pending eviction", but that doesn't matter a lot, disks are in read only, vm are fucked up after writing, it was our expected bevahior.
Re-plug node 1 and node 2.
Windows boot normally
Linux VM stays in a "broken state"

➜  ~ ssh VM1 
suConnection closed by UNKNOWN port 65535

Force rebooting all VMs from Xen-orchestra permit to revert all vms to a correct state

Limitation of the test

We didn't test a duration up to the eviction states of linstor nodes, but the documentation show that a linstor node restore would works ( see https://docs.xcp-ng.org/xostor/#what-to-do-when-a-node-is-in-an-evicted-state )
We didn't use HA at this time in the cluster, that could helped a bit in the recovery process. but in a precedent experience that i didn't "historize" like this one, the HA was completely down because it was not able to mount a file, i will probably write another topic on the forum to bring my results public.

Important notes

Having HA change the criticity of the following note.

This test show us that while we don't have HA, all Management components should NOT be placed in XOSTOR to avoid loosing access to it uppon reboot of the VM.
- If we maintain the idea to put Management component ( XO, Firewall etc... ) in the xostor without HA, we aim to increase recovery time because the recovery will be "manual" from IPMI.
Maybe should we simply force reboot nodes after network recovery ? but a bit violent, HA works like this.

Credit

Thanks to @olivierlambert, @ronan and other people on the discord canal for answering to daily question which permit to this kind of tests to be made. As promissed, i put my result online

Thanks for XOSTOR.

Futher tests to do: Retry with HA

henri9813

@olivierlambert our opnsense resets the TCP states so the firewall block packet because it forgot about the tcp session.

And then, a timeout occured in the middle of the export.

henri9813

Hello @olivierlambert

I confirm my issue came from my Firewall so, not related to XO.

However, it could be great to make logs more "clear", i mean:

Error: read ETIMEDOUT"

Become

Error: read ETIMEDOUT while connect to X.X.X.X:ABC

That would permit to understand more quickly my "real and weird" issue

Best regards,

henri9813

Hello, @Pilow

Thanks for the tips !

But it would be great to see it in the Task UI to find more easily the trick.

henri9813

Hello,

Sometimes, i have on some jobs ( which run every 4 hours ).

Howerver, the previous job is finished in success

Is it possible that has a relation with the mergeWorker of backups which could be running ? if it doesn't finished his operations ?

Example of logs:

xen-orchestra      | 2025-11-27T00:44:43.487Z xo:backups:mergeWorker INFO merge in progress {
xen-orchestra      |   done: 2057,
xen-orchestra      |   parent: '/xo-vm-backups/f37e259d-beaa-7617-e6f1-be814f21e056/vdis/29e0185e-2f67-44d4-bb9e-ee2a772e2543/b09c0230-219f-4ddf-8e19-bfed1464014f/20251126T070610Z.vhd',
xen-orchestra      |   progress: 25,
xen-orchestra      |   total: 8128
xen-orchestra      | }

Is it possible to have this in the XO Tasks sections ? it's interessant to see this.

Best regards

henri9813

Hello @ronan-a ,

but how recover from this situation ?

Thanks !

henri9813

Hello,

Thanks for your work !

We have some hypervisors of tests at Gladhost, we can use them with pleasure to test your work on xcp-ng 8.3 !

Best regards

henri9813

Hello,

I got my whole xostor destroyed, i don't know how precisely.

I found some errors in sattelite

Error context:
        An error occurred while processing resource 'Node: 'host', Rsc: 'xcp-volume-e011c043-8751-45e6-be06-4ce9f8807cad''
ErrorContext:
  Details:     Command 'lvcreate --config 'devices { filter=['"'"'a|/dev/md127|'"'"','"'"'a|/dev/md126p3|'"'"','"'"'r|.*|'"'"'] }' --virtualsize 52543488k linstor_primary --thinpool thin_device --name xcp-volume-e011c043-8751-45e6-be06-4ce9f8807cad_00000' returned with exitcode 5. 

Standard out: 


Error message: 
  WARNING: Remaining free space in metadata of thin pool linstor_primary/thin_device is too low (98.06% >= 96.30%). Resize is recommended.
  Cannot create new thin volume, free space in thin pool linstor_primary/thin_device reached threshold.

of course, i checked, my SR was not full

And the controller crashed, and i couldn't make it works.

Here is the error i got

==========

Category:                           RuntimeException
Class name:                         IllegalStateException
Class canonical name:               java.lang.IllegalStateException
Generated at:                       Method 'newIllegalStateException', Source file 'DataUtils.java', Line #870

Error message:                      Reading from nio:/var/lib/linstor/linstordb.mv.db failed; file length 2293760 read length 384 at 2445540 [1.4.197/1]

So i deduce the database was fucked-up, i tried to open the file as explained in the documentation, but the linstor schema was "not found" in the file, event if using cat i see data about it.

for now, i leave xostor and i'm back to localstorage until we know what to do when this issue occured with a "solution path".

henri9813

Hello, @DustinB

The https://vates.tech/xostor/ says:

The maximum size of any single Virtual Disk Image (VDI) will always be limited by the smallest disk in your cluster.

But in this case, maybe it can be stored in the "2TB disks" ? Maybe others can answer, i didn't test it.

henri9813

hello @DustinB.

Yes you right, i would perform this to be able to have VDI with more than 1TB disk. ( which will not be possible because my smallest disk is 1TB (so, 879GB )...

henri9813

Summary

This test permit to cover the following scenario:

Storage network is down
All networks are down

Impact:

All hosts cannot see each others
Linstor DRDB replication can no longer works ( everyone will be in readonly ).

Expected results:

All write operation on VM fails.
a reboot solve the issue.

Environments

3 hypervisors
- Node 1
- Node 2
- Node 3 ( Master )
13 vms ( mixed of Windows Server and Rocky Linux VMs on XFS, No kubernetes in this section ).
VM observation point is : VM1

We didn't tests other filesystem than XFS for Linux based operating system because we use only XFS.

Procedure

Unplug network cable from all non-master node:
- Node 1
- Node 2
Keep networks only on the master xcpng node to keep management and observe the behavior
Access to a VM located in the master node ( which is still reacheable ).
Try to write on VM1, ensure that you have "I/O" error.
Wait 5 minutes
Re-add back node 1 & Node 2
Check states of all VMs.
- Reboot them if needed.

Execution

Cable disconnected from node 1 and node 2
From VM1, we have

[hdevigne@VM1 ~]$ htop^C
[hdevigne@VM1 ~]$ echo "coucou" > test
-bash: test: Input/output error
[hdevigne@VM1 ~]$ dmesg
-bash: /usr/bin/dmesg: Input/output error
[hdevigne@VM1 ~]$ d^C
[hdevigne@VM1 ~]$ sudo -i
-bash: sudo: command not found
[hdevigne@VM1 ~]$ dm^C
[hdevigne@VM1 ~]$ sudo -i
-bash: sudo: command not found
[hdevigne@VM1 ~]$ dmesg
-bash: /usr/bin/dmesg: Input/output error
[hdevigne@VM1 ~]$ mount
-bash: mount: command not found
[hdevigne@VM1 ~]$ sud o-i
-bash: sud: command not found
[hdevigne@VM1 ~]$ sudo -i

As we predicted it, the vm is completly fucked-up

Windows VM crash and reboot in loop.
Linstor controller was on node 1, so we will not be able to see linstor nodes status, but we supposed they are in "disconnected" and in "pending eviction", but that doesn't matter a lot, disks are in read only, vm are fucked up after writing, it was our expected bevahior.
Re-plug node 1 and node 2.
Windows boot normally
Linux VM stays in a "broken state"

➜  ~ ssh VM1 
suConnection closed by UNKNOWN port 65535

Force rebooting all VMs from Xen-orchestra permit to revert all vms to a correct state

Limitation of the test

We didn't test a duration up to the eviction states of linstor nodes, but the documentation show that a linstor node restore would works ( see https://docs.xcp-ng.org/xostor/#what-to-do-when-a-node-is-in-an-evicted-state )
We didn't use HA at this time in the cluster, that could helped a bit in the recovery process. but in a precedent experience that i didn't "historize" like this one, the HA was completely down because it was not able to mount a file, i will probably write another topic on the forum to bring my results public.

Important notes

Having HA change the criticity of the following note.

This test show us that while we don't have HA, all Management components should NOT be placed in XOSTOR to avoid loosing access to it uppon reboot of the VM.
- If we maintain the idea to put Management component ( XO, Firewall etc... ) in the xostor without HA, we aim to increase recovery time because the recovery will be "manual" from IPMI.
Maybe should we simply force reboot nodes after network recovery ? but a bit violent, HA works like this.

Credit

Thanks to @olivierlambert, @ronan and other people on the discord canal for answering to daily question which permit to this kind of tests to be made. As promissed, i put my result online

Thanks for XOSTOR.

Futher tests to do: Retry with HA

henri9813

Hello,

We tried the compression feature.

You "can see" a benefit only if you have a shared storage. ( and again, the migration between 2 nodes is already very fast, we don't see major difference, but maybe a VM will a lot of ram ( >32GB ) can see a difference.

If you don't have a shared storage ( like XOSTOR, NFS, ISCSI ), then you will not see any difference because there is a limitation of 30MB/s-40MB/s ( see here: https://xcp-ng.org/forum/topic/9389/backup-migration-performance )

Best regards,

henri9813

Hello,

From my test, the result is having multiple xostor is not possible at this time. it's blocked.

( i didn't save the precise error message, but the error was clear: cannot have more than one XOSTOR in the pool ).

henri9813

@henri9813

Best posts made by henri9813

Summary

Environments

Procedure

Execution

Limitation of the test

Important notes

Credit

Latest posts made by henri9813

Summary

Environments

Procedure

Execution

Limitation of the test

Important notes

Credit