XOSTOR hyperconvergence preview

olivierlambert

Also, keep in mind the LINSTOR put things in read only as soon you are under your replication target.

It means, on a 3 hosts scenario:

if you have a replication 3, any host that is unreachable will trigger read only on the 2 others
if you have a replication 2, you can lose one host without any consequence

So for 3 machines, replication 2 is a sweet spot in terms of availability.

Wilken

Hi,

I've run the install script on a XCP-ng 8.2.1 host. The output of the following command:

rpm -qa | grep -E "^(sm|xha)-.linstor."

sm-2.30.8-2.1.0.linstor.5.xcpng8.2.x86_64

xha-10.1.0-2.2.0.linstor.1.xcpng8.2.x86_64

is missing, because it is already installed in version:

xha-10.1.0-2.1.xcpng8.2.x86_64

from XCP-ng itself.

Is this packace still needed from the linstor repo?
Should I uninstall it an re-run the install script?

BR,
Wilken

olivierlambert

question for @ronan-a

ronan-a

@Wilken The modified version of the xha package is no longer needed. You can use the latest version without the linstor tag.

It's not necessary to reinstall your XOSTOR SR.

Wilken

Thank you @olivierlambert and @ronan-a for the quick answer and clarification!

BR,
Wilken

gb.123

@ronan-a

Hi !
Before I test this, I have a small question:
If the VM is encrypted, and XOSTOR SR is enabled, is the VM + Memory replicated or just the VDI ?
Once the 1st node is down, will the 2nd node take over as-is or will the 2nd node go to 'boot' stage where is asks for decryption password ?

Thanks

ronan-a

@gb-123 How the VM is encrypted? Only the VDIs are replicated.

gb.123

@ronan-a

VMs would be using LUKS encryption.

So if only VDI is replicated and hypothetically, if I loose the master node or any other node actually having the VM, then I will have to create the VM again using the replicated disk? Or would it be something like DRBD where there are actually 2 VMs running in Active/Passive mode and there is an automatic switchover ? Or would it be that One VM is running and the second gets automatically started when 1st is down ?

Sorry for the noob questions. I just wanted to be sure of the implementation.

Maelstrom96

@gb-123 said in XOSTOR hyperconvergence preview:

@ronan-a

VMs would be using LUKS encryption.

So if only VDI is replicated and hypothetically, if I loose the master node or any other node actually having the VM, then I will have to create the VM again using the replicated disk? Or would it be something like DRBD where there are actually 2 VMs running in Active/Passive mode and there is an automatic switchover ? Or would it be that One VM is running and the second gets automatically started when 1st is down ?

Sorry for the noob questions. I just wanted to be sure of the implementation.

The VM metadata is at the pool level, meaning that you wouldn't have to re-create the VM if the current VM host has a failure. However, memory can't/isn't replicated in the cluster, unless you're doing a live migration which would temporarily replicate the VM memory to the new host, so it can be moved.

DRBD only replicates the VDI, or in other terms, the disk data across the active Linstor members. If the VM is stopped or is terminated because of host failure, you should be able to start it back up on another host in your pool, but by default, this will require manual intervention to start the VM, and will require you to input your encryption password since it will be a cold boot.

If you want the VM to automatically self-start in case of failure, you can use the HA feature of XCP-ng. This wouldn't solve your issue with having to input your encryption password since, like explain earlier, the memory isn't replicated, and it would cold boot from the replicated VDI. Also, keep in mind that enabling HA adds maintenance complexity and might not be worth it.

gb.123

@Maelstrom96

Thanks for your clarification !
I was thinking of testing HA with XOSTOR (If at all that is possible). XOSTOR would also be treated as 'Shared SR' I guess ?

ronan-a

@gb-123 Yes, use the shared flag as in the sr-create example of the first post, and you can activate the HA like any shared SR.

Swen

hi @ronan-a,
we did some performance testing with the latest version and we run into a bottleneck we are unable to identify in detail.

Here is our setup:
Dell R730
CPU: 2x Intel E5-2680v4
RAM: 384GB
Storage: 2x NVMe Samsung PM9A3 3.84TB via U.2 PCIe 3 x16 Extender Card
NICs: 2x 10G Intel, 2x 40G Intel

We have 3 servers with the same configuration and installed them as a cluster with replica count of 2.
xcp-ng 8.2 with latest patches is installed. All servers are using the same switch (2x QFX5100-24Q, configured as virtual chassis). We are using a LACP bond on the 40G interfaces.

When using the 10G interfaces (xcp-ng is using those interfaces as management interfaces) for linstor traffic we run into a cap on the nic bandwith of around 4 Gbit/s (500MB/s).
When using the bonded 40G interfaces the cap is around 8 Gbit/s (1000MB/s)

Only 1 VM is installed on the pool. We are using Ubuntu 22.04 LTS with latest updates installed from ISO using the template for Ubuntu 20.04.

Here is the fio command we are using:
fio --name=a --direct=1 --bs=1M --iodepth=32 --ioengine=libaio --rw=write --filename=/tmp/test.io --size=100G

I would expect far more because we do not hit any known bottleneck of interfaces, NVMe or PCIe slot. Do I miss something? Is this expected performance? If not, any idea what the bottleneck is? Does anybody have some data we can compare with?

regards,
Swen

olivierlambert

use iodepth of 128
use 4 process at the same time (numjobs=4)
use io_uring if you can in the guest (and not libaio)
don't use a test file but bench directly on a non-formatted device (like /dev/xvdb), this removes the filesystem layer

With those settings in fio, I can reach near 2600MiB/s in read, and 900MiB/s in write with 4x virtual disks in mdadm RAID0, in the guest (a test VM on Debian 12), on rather "old" Xeon CPUs and a PCIe 3 ports on an consumer grade NVMe SSD.

Also, latest thing to know: if you use thin pro, you need to run the test twice, the first run (while the VHD is growing), it's always slower. And this is not a problem in real life, you can run twice or 3 times and check the result your tests, without counting the first.

I'm about to get more recent hardware (except the NVMe) to re-run some tests this week. But as you can see, you can go over a 20G network (I'm using a 25G NIC)

Swen

@olivierlambert thx for the feedback! I do not get how you see that you reach 20G on your nic. Can you please explain it? I see that you reach 2600MiB/s in read, but this is more likely on local disk, isn't? What I can see in our lab environment is that for what ever reason we do not get more than around 8Gbits on pass-through via a 40G interface and 4Gbits via a 10G interface and therefore we do not get any good performance out of the storage repository. I am unable to find the root cause of this. Do you have any idea where to look? I can see high waits on the OS of the VM, but no waits inside dom0 of any node.

olivierlambert

First, always do your storage speed test in a regular VM. The Dom0 doesn't matter: you won't run your workload in it, so test what's relevant to you: inside a dedicated VM.

Also, it's not only a matter of network speed, but latency, DRBD, SSD speed any many other things. Only Optane drives or RAM are relevant to really push XOSTOR, because there's not a lot of NVMe that can sustain heavy writes without slowing down (especially on 100GiB file).

But first, start to benchmark with the right fio parameters, and let's see

Swen

@olivierlambert just to be sure: we did also use your recommended fio parameters with the exact same results. We used fio from inside a VM not from inside dom0. My comments regarding waits inside the VM and no waits in dom0 was just additional information.

I am aware of possible bottleneck like latency, SSD others, but in our case we can rule them out. Reason is that we double our speed when switching from 10G to 40G interface while the rest is the exact same configuration. As fsr as I can see this looks to me like xcp-ng is the bottleneck and limiting bandwidth of the interface in some way. Even the numbers you provided are not really good performance numbers. Did you get more bandwidth than 8 Gbits over the linstor interface?

We are going to install Ubuntu on the same servers and install linstor on it to test our infrastructure on bare-metal without any hypervisor to see if it is xcp-ng related or not.

olivierlambert

Those are really good numbers for a replicated block system, on top of a virtualization solution.

The fact you are doubling the speed isn't just about bandwidth, but also likely latency related. XOSTOR works in sync mode, so you have to wait for blocks to be written on the destination before getting the confirmation. You might try on bigger blocks to see if you can reach higher throughput.

Also, remember that if you test in a VM on a single virtual disk, that's absolutely the bottleneck here (tapdisk). There's one process per disk, that's why I advise to test either with multiple VMs at the same time to really push XOSTOR to its limits, or create a big RAID0 with many virtual drives you can (however, first option is better because you can have VMs on multiple hosts at the same time).

In short, the system scales with the number of VMs, not when benchmarking with one VM and one disk only.

Finally, don't forget thin mode that is requiring to run the test at least twice to really see the performance. On your side, you are very likely CPU bound due to an 8 years old Intel CPU/architecture, which is not that's efficient. But on that, I'll be able to provide a real result comparing my test bench in Xeon vs Zen in 2 days.

gb.123

Is it possible to change replication factor later on the fly eg. after adding a new host (without loosing data) ?

olivierlambert

That's a question for @ronan-a when he'll be back

But in any case, IIRC, the preferred replication number for now is 2.

ronan-a

@gb-123 You can use this command:

linstor resource-group modify xcp-sr-linstor_group_thin_device --place-count <NEW_COUNT>

You can confirm the resource group to use with:

linstor resource-group list

Ignore the default group named: DfltRscGrp and take the second.

Note: Don't use a replication count greater than 3.