XOSTOR hyperconvergence preview

abufrejoval

Ah, so diskless nodes aren't supported at Xcp-ng storage API level yet?

Because that was the next thing on my list of things to try and I'm confident enough to do it at the DRBD level (even if the documentation is skimping on examples there). But if still needs SR integration on the Xen hosts, then I can push that back onto the todo stack.

For background: For Xcp-ng and oVirt I have HCI clusters running permanently on low-power machines. And then I have powerful (noisy and hungry) workstations which I turn off when I'm not running experiments (they also run all kids of different operating systems).

So these only occasionally connect to the clusters but need access to the HCI storage. That's very natural in GlusterFS and I need something similar in LINSTOR.

ronan-a

@dumarjo FYI this logic is available since few days in this commit: https://github.com/xcp-ng/sm/commit/ec3ffffced1bf63fc3a88e0681ecbf7e288828de But not merged in the current beta.

Regarding the splitbrain, it's probably because you use two hosts, the minimal ideal count is 3. With a replication count of two, the third node can be used as a quorum “tie breaker” diskless node.

Imagine if I have 30 Vms... alot of resources to be created....

I'm not sure to understand the link with VMs? ^^"

0

Wescoeur committed to xcp-ng/sm

feat(linstor-manager): add methods to add remove/host from LINSTOR SR

Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.fr>

dumarjo

@ronan-a

Imagine if I have 30 Vms... alot of resources to be created....

I'm not sure to understand the link with VMs? ^^"

From what I understand, I have to recreate all the resource manually, since all the VMs disk create a resource.. Maybe I'm wrong again

I'm very interested testing this new add_host functionnality. The fun part is now I have a better undertanding of what is going on under the hood !

regards,

ronan-a

@abufrejoval Diskless volumes are supported, you can start a VM using this type of volume on a host. A diskless link is created on the fly if necessary.

Also good news, it's now possible to use a host without physical disk. I just removed a limitation during the SR creation: https://github.com/xcp-ng/sm/commit/536277d4026ac7baeb9890fe65a87feb7d5a4721
We must test this change, but in theory it should not create a problem.

1

Wescoeur committed to xcp-ng/sm

feat(LinstorVolumeManager): support SR creation with diskless nodes

Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.fr>

ronan-a

@dumarjo I think we can add a script to reconfigure the replication. For example to create a copy of each VDI on a new node, or in a balanced way on a set of new nodes.

Swen

@ronan-a Is it already possible to specify a network adapter for the storage traffic?
I edited /etc/hosts and added each xcp-ng server with the IP of the storage interface. IS this a good workaround?

abufrejoval

@ronan-a

That brings me to the topic of observability:

I can't say I have been entirely happy observing what was going on in Gluster on oVirt, but depending on if you used the chunking mode (or the oVirt storage overlay) vs. the pure file mode, you had a rather granular overview on what was going on, what was good, what needed healing and just how far behind synchronizations might be.

With DRBD I feel like flying blind again, mostly because it's a block not a file layer. From what I've seen in the DRBD and LINSTOR manuals, I'll be able to query replication state and whether or not replicas are in sync. When they are not and offlined, because the (limited?) update queue has overflowed, it seems you may have to re-create the replica. Yet there is also a checksumming mode, which might be able to "resilver" a replica even if the update queue isn't complete. I guess that's where LINBIT wants to sell consulting or support...

So when you suggest control over replication at the VDI level, I wonder how this happens, since without another layer in between, I can only imagine replication control at the SR level using distinct DRBD resources. Perhaps some explanations on how Xcp SRs and DRBD resources and volumes are supposed to correlate would be helpful.

In my edge oriented HCI setups, I'd just be using a triple replica setup, because it's a nice compromise between the write amplification and redundancy. Yeah, having a (pop-up?) arbiter that helps maintain a quorum while you're doing maintenance on one node, wouldn't be too bad to have, but I've not been too happy with 2 replica + 1 arbiter Glusters on oVirt: You're really only standing on one leg when doing maintenance or handling faults. I used it on the 2.5Gbit nodes, because write amplification was too expensive on the 10Gbit nodes with NVMe I prefer 3 replicas, if only to reduce making mistakes.

For the additional compute nodes I prefer to go diskless, also because I shut them down to save power when load is low.

But that's the home-lab. For the corporate lab (which is what I am testing it for), there it's more like a dozen machines, some storage heavy (recycled), some compute heavy (GPGPU compute), with both populations changing, sometimes by choice, sometimes because they fail.

Now since erasure coding isn't LINSTOR native, having to use staggered replicas in distinct SRs to manage fault-tolerance/write-amplification/storage-efficiency will quickly become a real burden: I'd love to know how much intelligence you're willing to put into XOA to help manage redistributions (which require observability). At least in theory, Gluster was vastly superior there, not that I've actually tried transforming terabytes of dispersed volumes say from a 6+2 to a 12+3 configuration.

And to be quite honest: I'm still struggling to understand the abstraction topology of DBRD/LINSTOR/Pacemaker and then their new LINBIT VSAN. Everbody is so focused on producing videos or 'getting started' tutorials, they completely forget writing a proper concept's & architecture guide.

ronan-a

@ronan-a Is it already possible to specify a network adapter for the storage traffic?

@Swen It should work, you can test using this doc: https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-managing_network_interface_cards

I edited /etc/hosts and added each xcp-ng server with the IP of the storage interface. IS this a good workaround?

Not a good idea to modify directly this file. Using the previous link, it's normally the right direction to use.

So when you suggest control over replication at the VDI level, I wonder how this happens, since without another layer in between, I can only imagine replication control at the SR level using distinct DRBD resources. Perhaps some explanations on how Xcp SRs and DRBD resources and volumes are supposed to correlate would be helpful.

For each VDI, you have a DRBD. A VDI is just a link to a DRBD device (/dev/drbd/by-res/XXX), this volume is replicated on N hosts of your pool. A volume is strictly equal to one block. If a host doesn't have a copy of the volume, then it will use a DRBD diskless and the network to access data.

In my edge oriented HCI setups, I'd just be using a triple replica setup, because it's a nice compromise between the write amplification and redundancy. Yeah, having a (pop-up?) arbiter that helps maintain a quorum while you're doing maintenance on one node, wouldn't be too bad to have, but I've not been too happy with 2 replica + 1 arbiter Glusters on oVirt: You're really only standing on one leg when doing maintenance or handling faults. I used it on the 2.5Gbit nodes, because write amplification was too expensive on the 10Gbit nodes with NVMe I prefer 3 replicas, if only to reduce making mistakes.

With LINSTOR, 3 hosts is the minimal config to limit split-brain (with diskless volume or not). Also regarding performance It's actually a good compromise between reading and writing to use a replication count of 3.

For the additional compute nodes I prefer to go diskless, also because I shut them down to save power when load is low.

FYI, a third node used as diskless volume is important to avoid a split brain, it's a quorum component.

And to be quite honest: I'm still struggling to understand the abstraction topology of DBRD/LINSTOR/Pacemaker and then their new LINBIT VSAN. Everbody is so focused on producing videos or 'getting started' tutorials, they completely forget writing a proper concept's & architecture guide.

Pacemaker is not used in our driver implementation, linstor is a manager on top of DRBD. Yeah the architecture is complex, so it's why our goal here is to abstract these layers to use this new LINSTOR smapi driver like the existing ones.

Swen

@Swen It should work, you can test using this doc: https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-managing_network_interface_cards

Thank you very much for that link! I edited the setting according to this and now I can see traffic is going in and out via 10G NIC.

It looks like I only get descend bandwidth through it, around 130 MBps (Shown in XCP-ng Center). Do you know if there is any limitation on xcp-ng side which prevents Linstor to use more of the interface? I use 3 1TB SSDs on each node so it should be more possible.

ronan-a

@Swen What's your replication count? 3 diskful or 2 + 1 diskless?

Swen

@ronan-a I was running a replication count of 2 with a 3 nodes cluster, all with disks. You see I wrote "was". I am reinstalling the cluster as I write this, because I got into a state where I was unable to even stop am VM on it.

Do I understand it correctly that I can use a replication count of 2 within a 3-node cluster and my data will be replicated 2 times so on 2 nodes? Or do I need to use a replication count of 3 on a 3-node cluster to be ablet o let my VMs running on all nodes and be able to do a live migration to all nodes?

ronan-a

@Swen Do I understand it correctly that I can use a replication count of 2 within a 3-node cluster and my data will be replicated 2 times so on 2 nodes?

Yes. Each VDI will be replicated two times on different nodes.

Or do I need to use a replication count of 3 on a 3-node cluster to be ablet o let my VMs running on all nodes and be able to do a live migration to all nodes?

You can start a VM on any node with a replication count of 2. In this case diskful or diskless volume is used. Of course in this last case, the performance can be impacted by the network.

Swen

@ronan-a do you know anything about nic bandwidth limitations of xcp-ng? It looks like I am unable to use the full bandwidth of the 10Gbit connection between the nodes. I get 125 MBps which is the maximum of a 1Gbit NIC if I calulate correctly.
If I do the test n 2 VMs on the storage the max bandwidth stays the same.

abufrejoval

@Swen

I've observed a similar issue, when I was testing the driver for the 2.5GBit/s USB3 NIC, while the system was running on a 1Gbit connection normally: somehow iperf3 gave me Gbit results even when I was clearly talking to the IP of the 2.5GBit port, which ethtool confirmed to be running at 2.5Gbit/s speed.

Well except when I took the Gbit interface down to make sure nothing fishy was going on, the "2.5Gbit" connection went down with it.

My explanation is that in fact it was talking to the Gbit port, which is configured as promiscous by Xcp-ng and 'hijacked' traffic to both IPs, so I didn't really reach the 2.5Gbit port.

I can easily imagine something similar going on in your case.

I haven't had time to test further, but I'm pretty sure you'll have to make the 10Gbit port fully known to Xcp to avoid issues with the promiscuity of the management interface or perhaps you can try with separated switches (or a cross connect cable) for the 10Gbit part, just to confirm the diagnose.

Swen

@abufrejoval thx for your feedback. I need to investigate this further. We already using a different switch for the 10Gbit interfaces with another IP subnet.

Swen

@ronan-a I was unable to find some limitations regardings the bandwidth of an interface. Do you know anything about it?

abufrejoval

@Swen

How do you measure? Do you measure disk I/O e.g. via Jens Axboe's wounderful fio tool or do you measure at the network level e.g. via iperf3first?

I've gotten around 300MB/s write speeds inside a Windows VM using Crystal Disk Mark with 4-way LINSTOR replication using Xcp-ng running nested under VMware Workstation on Windows (Ryzen 9 5950X 16-core with plenty of RAM all NVMe storage).

Iperf3 between these virtual Xcp-ng hosts will only yield around 5Gbit/s, so 300MB/s is rather better than I'd expect, given that each block is replicated 4 times. Reads on Crystal Disk Mark are better than 1.3GB/s as they don't suffer from write amplification and could actually be done round-robin (and it seems they are, too).

But that's a nested virtualization setup, which is really just meant for functional failure testing, not for meaningful benchmarking.

I haven't gotten around to using LINSTOR yet on my physical NUC8/10/11 cluster using 10Gbit NICs, but they give me close to 10Gbit/s with iperf3, while a Xeon-D 1542 based host only reaches about 5-6Gbit/s with budget Aquantia ACC107 NICs all around, that don't support much in terms of offload capabilities.

On oVirt I used an MTU of 9000 to reach full 10Gbit bandwidth on all machines, but I haven't found any documentation on how to increase the MTU on the physical NICs in Xcp-ng yet.

Swen

@abufrejoval I am using dd on Ubuntu20 VMs on 3 ProLiant D360 servers with SSDs. I mounted 1 SSD directly to XCP-ng and 3 to linstor on each server. When I do a

dd if=/dev/zero of=benchfile bs=4k count=2000000 && sync; rm benchfile

on a VM using local storage I get around 185MB/s

when I do the same on 1 VM on linstor storage I get around 125MB/s

but when I do the test on 2 VM on linstor storage on the same XCP-ng host I get around 60MB/s each.

Do me it looks like the NIC is the bottleneck, but please correct me if I am wrong.

Swen

@ronan-a another thing I found is that it linstor occupies more storage than expected. I created the sr with option 'thin'. I created 2 VMs each with 50GB disk. XCP-ng cente ris shoing me

238.7 GB used of 2.6 TB total (150 GB allocated)

I would not expected that! I would expected less than 100 GB used and allocated.

ronan-a

@Swen Could you list the VDIs of your linstor SR please?