abufrejoval

abufrejoval

That brings me to the topic of observability:

I can't say I have been entirely happy observing what was going on in Gluster on oVirt, but depending on if you used the chunking mode (or the oVirt storage overlay) vs. the pure file mode, you had a rather granular overview on what was going on, what was good, what needed healing and just how far behind synchronizations might be.

With DRBD I feel like flying blind again, mostly because it's a block not a file layer. From what I've seen in the DRBD and LINSTOR manuals, I'll be able to query replication state and whether or not replicas are in sync. When they are not and offlined, because the (limited?) update queue has overflowed, it seems you may have to re-create the replica. Yet there is also a checksumming mode, which might be able to "resilver" a replica even if the update queue isn't complete. I guess that's where LINBIT wants to sell consulting or support...

So when you suggest control over replication at the VDI level, I wonder how this happens, since without another layer in between, I can only imagine replication control at the SR level using distinct DRBD resources. Perhaps some explanations on how Xcp SRs and DRBD resources and volumes are supposed to correlate would be helpful.

In my edge oriented HCI setups, I'd just be using a triple replica setup, because it's a nice compromise between the write amplification and redundancy. Yeah, having a (pop-up?) arbiter that helps maintain a quorum while you're doing maintenance on one node, wouldn't be too bad to have, but I've not been too happy with 2 replica + 1 arbiter Glusters on oVirt: You're really only standing on one leg when doing maintenance or handling faults. I used it on the 2.5Gbit nodes, because write amplification was too expensive on the 10Gbit nodes with NVMe I prefer 3 replicas, if only to reduce making mistakes.

For the additional compute nodes I prefer to go diskless, also because I shut them down to save power when load is low.

But that's the home-lab. For the corporate lab (which is what I am testing it for), there it's more like a dozen machines, some storage heavy (recycled), some compute heavy (GPGPU compute), with both populations changing, sometimes by choice, sometimes because they fail.

Now since erasure coding isn't LINSTOR native, having to use staggered replicas in distinct SRs to manage fault-tolerance/write-amplification/storage-efficiency will quickly become a real burden: I'd love to know how much intelligence you're willing to put into XOA to help manage redistributions (which require observability). At least in theory, Gluster was vastly superior there, not that I've actually tried transforming terabytes of dispersed volumes say from a 6+2 to a 12+3 configuration.

And to be quite honest: I'm still struggling to understand the abstraction topology of DBRD/LINSTOR/Pacemaker and then their new LINBIT VSAN. Everbody is so focused on producing videos or 'getting started' tutorials, they completely forget writing a proper concept's & architecture guide.

abufrejoval

@olivierlambert
Merci Oliver (& Stormi), I found the issue was in front of the computer (again): I had missed to clone the repos before starting the run.py.

Seems to be working just now, at least it's compiling a ton of stuff....

abufrejoval

@fred974

On the virtual machine advanced attributes tab you can chose between a RealTek RTL8139 and an Intel e1000 (both virtual devices) NIC. It defaults to RealTek and the suggestion is to change it to the Intel one and retry booting the VM.

The NIC on the host (X520) does not matter at all for this operation.

It's been awhile, but I can confirm that moving images between (in my case) oVirt and Xcp-ng VMs via Clonezilla works just fine.

Also did it between VMware and Xcp-ng btw.

abufrejoval

@Swen

There is obviously tons of variations....

I've used this fio file a lot to quickly gain an understanding of how a bit of storage performs.

Basically it only uses a small 100MB file, but tells the OS to avoid buffering and then goes over that with a mix of reads and writes, mostly transitioning between block size, essentially going from super random to almost sequential in a single run.

It's helped me find issues with Gluster, identify network bandwidth issues or even find deteriorated RAIDs with a bad BBU. Creates the test file in the working directiory unless changed.

[global]
filename=fio.file
ioengine=libaio
rw=randrw
size=100m
norandommap
direct=1
iodepth=1
time_based
runtime=10
[B512]
bs=512
stonewall
[B1k]
bs=1k
stonewall
[B2k]
bs=2k
stonewall
[b4k]
bs=4k
stonewall
[b8k]
bs=8k
stonewall
[b16k]
bs=16k
stonewall
[b32k]
bs=32k
[b64k]
bs=64k
stonewall
[b512k]
bs=512k
stonewall
[b1m]
bs=1m
stonewall

Numbers: It should approach the network bandwidth towards the end (potentially divided by write amplification).

abufrejoval

@Swen
Writing zeros should result in nothing written with thin allocation (or dedup and compression): that's why I am hesitant to use /dev/zero as a source.

Of course /dev/random could require to much of an overhead, depending on the quality and implementation which is why I like to use fio: a bit of initial effort to know and understand the tool, but much better control, especially when it comes to dealing with an OS that tries to be smart.

abufrejoval

@fred974
Nice to hear!

Yes, not having a template or using a wrong one seems to be an issue for Xcp-ng, when really it should be nothing but an easier way to set presets.

I guess one of these days I'll simply have to investigate what's in them and how difficult it would be to create your own.

Clonezilla has failed me on machines with the more complicated LVM and thin-allocation setups e.g. when trying to virtualize physical machines. But when it comes to cloning VMs between hypervisors it pays off that I tend to keep these simply and have the hypervisor deal with sparsity and overcommit.

Clonzilla certainly has helped me out a lot of times and I'm really glad they maintain it.

abufrejoval

@fred974

On the virtual machine advanced attributes tab you can chose between a RealTek RTL8139 and an Intel e1000 (both virtual devices) NIC. It defaults to RealTek and the suggestion is to change it to the Intel one and retry booting the VM.

The NIC on the host (X520) does not matter at all for this operation.

It's been awhile, but I can confirm that moving images between (in my case) oVirt and Xcp-ng VMs via Clonezilla works just fine.

Also did it between VMware and Xcp-ng btw.

abufrejoval

@Swen

How do you measure? Do you measure disk I/O e.g. via Jens Axboe's wounderful fio tool or do you measure at the network level e.g. via iperf3first?

I've gotten around 300MB/s write speeds inside a Windows VM using Crystal Disk Mark with 4-way LINSTOR replication using Xcp-ng running nested under VMware Workstation on Windows (Ryzen 9 5950X 16-core with plenty of RAM all NVMe storage).

Iperf3 between these virtual Xcp-ng hosts will only yield around 5Gbit/s, so 300MB/s is rather better than I'd expect, given that each block is replicated 4 times. Reads on Crystal Disk Mark are better than 1.3GB/s as they don't suffer from write amplification and could actually be done round-robin (and it seems they are, too).

But that's a nested virtualization setup, which is really just meant for functional failure testing, not for meaningful benchmarking.

I haven't gotten around to using LINSTOR yet on my physical NUC8/10/11 cluster using 10Gbit NICs, but they give me close to 10Gbit/s with iperf3, while a Xeon-D 1542 based host only reaches about 5-6Gbit/s with budget Aquantia ACC107 NICs all around, that don't support much in terms of offload capabilities.

On oVirt I used an MTU of 9000 to reach full 10Gbit bandwidth on all machines, but I haven't found any documentation on how to increase the MTU on the physical NICs in Xcp-ng yet.

abufrejoval

@Swen

I've observed a similar issue, when I was testing the driver for the 2.5GBit/s USB3 NIC, while the system was running on a 1Gbit connection normally: somehow iperf3 gave me Gbit results even when I was clearly talking to the IP of the 2.5GBit port, which ethtool confirmed to be running at 2.5Gbit/s speed.

Well except when I took the Gbit interface down to make sure nothing fishy was going on, the "2.5Gbit" connection went down with it.

My explanation is that in fact it was talking to the Gbit port, which is configured as promiscous by Xcp-ng and 'hijacked' traffic to both IPs, so I didn't really reach the 2.5Gbit port.

I can easily imagine something similar going on in your case.

I haven't had time to test further, but I'm pretty sure you'll have to make the 10Gbit port fully known to Xcp to avoid issues with the promiscuity of the management interface or perhaps you can try with separated switches (or a cross connect cable) for the 10Gbit part, just to confirm the diagnose.

abufrejoval

@ronan-a