Posts made by abufrejoval | XCP-ng and XO forum

abufrejoval

There is obviously tons of variations....

I've used this fio file a lot to quickly gain an understanding of how a bit of storage performs.

Basically it only uses a small 100MB file, but tells the OS to avoid buffering and then goes over that with a mix of reads and writes, mostly transitioning between block size, essentially going from super random to almost sequential in a single run.

It's helped me find issues with Gluster, identify network bandwidth issues or even find deteriorated RAIDs with a bad BBU. Creates the test file in the working directiory unless changed.

[global]
filename=fio.file
ioengine=libaio
rw=randrw
size=100m
norandommap
direct=1
iodepth=1
time_based
runtime=10
[B512]
bs=512
stonewall
[B1k]
bs=1k
stonewall
[B2k]
bs=2k
stonewall
[b4k]
bs=4k
stonewall
[b8k]
bs=8k
stonewall
[b16k]
bs=16k
stonewall
[b32k]
bs=32k
[b64k]
bs=64k
stonewall
[b512k]
bs=512k
stonewall
[b1m]
bs=1m
stonewall

Numbers: It should approach the network bandwidth towards the end (potentially divided by write amplification).

abufrejoval

@Swen
Writing zeros should result in nothing written with thin allocation (or dedup and compression): that's why I am hesitant to use /dev/zero as a source.

Of course /dev/random could require to much of an overhead, depending on the quality and implementation which is why I like to use fio: a bit of initial effort to know and understand the tool, but much better control, especially when it comes to dealing with an OS that tries to be smart.

abufrejoval

@fred974
Nice to hear!

Yes, not having a template or using a wrong one seems to be an issue for Xcp-ng, when really it should be nothing but an easier way to set presets.

I guess one of these days I'll simply have to investigate what's in them and how difficult it would be to create your own.

Clonezilla has failed me on machines with the more complicated LVM and thin-allocation setups e.g. when trying to virtualize physical machines. But when it comes to cloning VMs between hypervisors it pays off that I tend to keep these simply and have the hypervisor deal with sparsity and overcommit.

Clonzilla certainly has helped me out a lot of times and I'm really glad they maintain it.

abufrejoval

@fred974

On the virtual machine advanced attributes tab you can chose between a RealTek RTL8139 and an Intel e1000 (both virtual devices) NIC. It defaults to RealTek and the suggestion is to change it to the Intel one and retry booting the VM.

The NIC on the host (X520) does not matter at all for this operation.

It's been awhile, but I can confirm that moving images between (in my case) oVirt and Xcp-ng VMs via Clonezilla works just fine.

Also did it between VMware and Xcp-ng btw.

abufrejoval

@Swen

How do you measure? Do you measure disk I/O e.g. via Jens Axboe's wounderful fio tool or do you measure at the network level e.g. via iperf3first?

I've gotten around 300MB/s write speeds inside a Windows VM using Crystal Disk Mark with 4-way LINSTOR replication using Xcp-ng running nested under VMware Workstation on Windows (Ryzen 9 5950X 16-core with plenty of RAM all NVMe storage).

Iperf3 between these virtual Xcp-ng hosts will only yield around 5Gbit/s, so 300MB/s is rather better than I'd expect, given that each block is replicated 4 times. Reads on Crystal Disk Mark are better than 1.3GB/s as they don't suffer from write amplification and could actually be done round-robin (and it seems they are, too).

But that's a nested virtualization setup, which is really just meant for functional failure testing, not for meaningful benchmarking.

I haven't gotten around to using LINSTOR yet on my physical NUC8/10/11 cluster using 10Gbit NICs, but they give me close to 10Gbit/s with iperf3, while a Xeon-D 1542 based host only reaches about 5-6Gbit/s with budget Aquantia ACC107 NICs all around, that don't support much in terms of offload capabilities.

On oVirt I used an MTU of 9000 to reach full 10Gbit bandwidth on all machines, but I haven't found any documentation on how to increase the MTU on the physical NICs in Xcp-ng yet.

abufrejoval

@Swen

I've observed a similar issue, when I was testing the driver for the 2.5GBit/s USB3 NIC, while the system was running on a 1Gbit connection normally: somehow iperf3 gave me Gbit results even when I was clearly talking to the IP of the 2.5GBit port, which ethtool confirmed to be running at 2.5Gbit/s speed.

Well except when I took the Gbit interface down to make sure nothing fishy was going on, the "2.5Gbit" connection went down with it.

My explanation is that in fact it was talking to the Gbit port, which is configured as promiscous by Xcp-ng and 'hijacked' traffic to both IPs, so I didn't really reach the 2.5Gbit port.

I can easily imagine something similar going on in your case.

I haven't had time to test further, but I'm pretty sure you'll have to make the 10Gbit port fully known to Xcp to avoid issues with the promiscuity of the management interface or perhaps you can try with separated switches (or a cross connect cable) for the 10Gbit part, just to confirm the diagnose.

abufrejoval

@ronan-a

That brings me to the topic of observability:

I can't say I have been entirely happy observing what was going on in Gluster on oVirt, but depending on if you used the chunking mode (or the oVirt storage overlay) vs. the pure file mode, you had a rather granular overview on what was going on, what was good, what needed healing and just how far behind synchronizations might be.

With DRBD I feel like flying blind again, mostly because it's a block not a file layer. From what I've seen in the DRBD and LINSTOR manuals, I'll be able to query replication state and whether or not replicas are in sync. When they are not and offlined, because the (limited?) update queue has overflowed, it seems you may have to re-create the replica. Yet there is also a checksumming mode, which might be able to "resilver" a replica even if the update queue isn't complete. I guess that's where LINBIT wants to sell consulting or support...

So when you suggest control over replication at the VDI level, I wonder how this happens, since without another layer in between, I can only imagine replication control at the SR level using distinct DRBD resources. Perhaps some explanations on how Xcp SRs and DRBD resources and volumes are supposed to correlate would be helpful.

In my edge oriented HCI setups, I'd just be using a triple replica setup, because it's a nice compromise between the write amplification and redundancy. Yeah, having a (pop-up?) arbiter that helps maintain a quorum while you're doing maintenance on one node, wouldn't be too bad to have, but I've not been too happy with 2 replica + 1 arbiter Glusters on oVirt: You're really only standing on one leg when doing maintenance or handling faults. I used it on the 2.5Gbit nodes, because write amplification was too expensive on the 10Gbit nodes with NVMe I prefer 3 replicas, if only to reduce making mistakes.

For the additional compute nodes I prefer to go diskless, also because I shut them down to save power when load is low.

But that's the home-lab. For the corporate lab (which is what I am testing it for), there it's more like a dozen machines, some storage heavy (recycled), some compute heavy (GPGPU compute), with both populations changing, sometimes by choice, sometimes because they fail.

Now since erasure coding isn't LINSTOR native, having to use staggered replicas in distinct SRs to manage fault-tolerance/write-amplification/storage-efficiency will quickly become a real burden: I'd love to know how much intelligence you're willing to put into XOA to help manage redistributions (which require observability). At least in theory, Gluster was vastly superior there, not that I've actually tried transforming terabytes of dispersed volumes say from a 6+2 to a 12+3 configuration.

And to be quite honest: I'm still struggling to understand the abstraction topology of DBRD/LINSTOR/Pacemaker and then their new LINBIT VSAN. Everbody is so focused on producing videos or 'getting started' tutorials, they completely forget writing a proper concept's & architecture guide.

abufrejoval

@ronan-a

Ah, so diskless nodes aren't supported at Xcp-ng storage API level yet?

Because that was the next thing on my list of things to try and I'm confident enough to do it at the DRBD level (even if the documentation is skimping on examples there). But if still needs SR integration on the Xen hosts, then I can push that back onto the todo stack.

For background: For Xcp-ng and oVirt I have HCI clusters running permanently on low-power machines. And then I have powerful (noisy and hungry) workstations which I turn off when I'm not running experiments (they also run all kids of different operating systems).

So these only occasionally connect to the clusters but need access to the HCI storage. That's very natural in GlusterFS and I need something similar in LINSTOR.

abufrejoval

@Andrew

Thanks for your response!

In the mean-time I've finally found the hints on how to work around the USB NIC renaming issues, both on the forum (even directed at me, but somehow not read) and by Eric Eikrem on his site, so I'll try that next to make the r8156 2.5Gbit USB3 NICs work (got lots of those) on the Atom boxes.

I'm not touching the NUCs (for igb/e1000 testing) at the moment, because I need them very stable to play with LINSTOR without a VGA console.

Just to illustrate: for weeks my NUC10 would disappear from network after a couple of days without issues and even if it was still visibly running (normal HDD LED activity) nothing but a hard reset would bring it back online. Just couldn't understand what was going on and if it was some type of hardware issue with the box (just out of warranty).

In the end it was one of the myriad of BIOS settings, could have been 'modern standby' or ASPM, which was reactivated after a firmware update and caused these problems days later.

abufrejoval

@Andrew
I'm messing around with all sorts of things:

Loss of video output on Gen10/11 iGPUs and Ryzen 3 (Cezanne) iGPU during the Xen-Dom0 handover (may be a grub issue with the UEFI frame buffer driver). It's on hold, for lack of time and because Xen doesn't make hacking boot stuff any easier and it would take me weeks I do not have to get deep enough. Also, apart from the missing console, the machines work just fine, after I transplant the installed system from the NUC8 to the other targets.
Lack of IOMMU support on my Ryzen 9 5950X with an Nvidia RTX 2080ti. It's been judged a BIOS issue here, but it works just fine with KVM and VMware. I've been going through the code, but unless I get kernel debugging going during the boot phase with a serial console, there is little chance of tracking down what's going on. From the logs alone, the code simply can't find the IOMMU device, or rather the data structures that describe it, but it's a ton of barely readable deeply cascaded spagetti of #define 'function calls'... written by an AMD guy, so XenSource will most likely point fingers, not invest into a work-around. Funny thing there is that this very system had been the first to run Xcp-ng in my lab, using a nested setup with VMware Workstation on Windows 2019 as a base....
Support for RealTek r8156 USB3 2.5 Gbit Ethernet adapters: I use those on Pentium Silver J5005 based passive mini-ITX machines with oVirt and want to transition them to something still alive. I got a version of the driver that compiles and works just fine, but Xensource uses all kinds of tricks to rename NICs to be consistent across a pool that might have widely different (and in the case of USB NICs, dynamic) device names assigned to NICs. Currently the Citrix code at the base of interface-rename cannot deal with NICs that aren't connected (directly) to the PCI bus. It doesn't look for USB devices and thus the bridge creation and the overlay network stuff just fails to use the device. I guess the only sensible thing to do is to open a ticket at XenSource and see if Andrew Cooper, who seems to have written all the xcp Python bindings, will incorporate USB NICs ...which I doubt, given the giant amount of extra support trouble hotplugging NICs might bring about, when there is no revenue in this space. Would be an interesting test of XenSource collaboration dynamics to see if an xcp-ng based addition would be accepted upstream

If you're confident hacking XenSource Python2 library scripts, have a look at /lib/python2.7/site-packages/xcp/pci.py class PCIDevices (line 259), where it's using lspci -mn to find NICs.

I've also been looking a bit at support for the newer Intel NICs, which are built into my NUC10 and NUC11 devices, which aren't supported by the 4.19 kernel e1000e device driver. Again, it's not a priority for me, because I am using TB3 connected Aquantia 10GBase-T NICs for these faster NUCs with NVMe storage, as they are just the better match and literally zero trouble, if you disable the onboard NICs before installing.

The technical evolution in the mobile/desktop based edge appliance space currently is at a pace that completely overwhelms the XenSource roadmap, even Linux itself in many ways, because only Windows support sells that hardware. It's a bit of a nasty turn on NUCs, which for many years were a nicely conservative platform with great Linux support and plenty of efficient power for the home lab.

abufrejoval

I have used an Nvidia GTX1080ti for pass-through testing, which also combines the GPU and a USB-C controller. It works just fine for CUDA stuff, (remote) gaming via Steam on Windows isn't great but proves pass-through is working.

The only issue I had was that I had to make sure that I was putting both devices in a single line for the boot flags, the Dom0 delete and the DomU add (xe vm-param-set), and not add them line after line, as evidently these statements are not cumulative.... I wound up with only the USB controller being visible inside the VM until I re-read the documentation.

I also had nothing connected the USB controller at the time, which might easily be an issue otherwise.

Did you check that the pass-through devices were truly gone from Dom0 before you tried adding them to the VM?

Then it's evidently 2 GPUs with a USB-C controller each: were all 4 devices passed through, or was it perhaps a mix across the boards?

abufrejoval

@Kaiz

I am very much guessing here, but: I don't think you're looking at a single card in this case. Rather must be a switch of some kind involved, which translates between PCI and PCIe and also allows more than one PCI device to be connected to the PCIe bus.

And you'd have to

remove all devices from the Dom0 (host)
add all devices to the DomU (guest) VM (as a list in a single command)

lshw may help you figure out the bus topology and device IDs, although none of the Linux tools seem to come close to how HWinfo displays things on Windows. It should definitely help you see things more clearly on the Windows VM, once devices arrive there.

If you have any other device, say a spare NIC to test on first, you may be able to save yourself pulling from pulling your hair out.

abufrejoval

@stormi

Thanks Stormi, that's true and it does seem to include the patched source...

So I wonder (just a bit): What's the difference between the two?

But I suspect it's simply some rpmbuild magic...

abufrejoval

@olivierlambert

No matter if I am building xen, xapi or the kernels, there is one issue for which I'd like some help:

Just before the build finishes, all the sources used for the build get deleted, even when using the --no-exit option to stay in the container.

E.g. the final lines of the kernel build are like this:

Processing files: python2-perf-alt-4.19.227-1.xcpng8.2.x86_64
Executing(%license): /bin/sh -e /var/tmp/rpm-tmp.kWRZph
+ umask 022
+ cd /home/builder/rpmbuild/BUILD
+ cd kernel-4.19.19
+ LICENSEDIR=/home/builder/rpmbuild/BUILDROOT/kernel-alt-4.19.227-1.xcpng8.2.x86_64/usr/share/licenses/python2-perf-alt-4.19.227
+ export LICENSEDIR
+ /usr/bin/mkdir -p /home/builder/rpmbuild/BUILDROOT/kernel-alt-4.19.227-1.xcpng8.2.x86_64/usr/share/licenses/python2-perf-alt-4.19.227
+ cp -pr COPYING /home/builder/rpmbuild/BUILDROOT/kernel-alt-4.19.227-1.xcpng8.2.x86_64/usr/share/licenses/python2-perf-alt-4.19.227
+ exit 0
Provides: gitsha(ssh://git@code.citrite.net/XS/linux.pg.git) = cb3c28f7e8213ef44e5c06369b577a18b86af291 gitsha(ssh://git@code.citrite.net/XSU/linux-stable.git) = dffbba4348e9686d6bf42d54eb0f2cd1c4fb3520 python2-perf-alt python2-perf-alt = 4.19.227-1.xcpng8.2 python2-perf-alt(x86-64) = 4.19.227-1.xcpng8.2
Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1
Requires: libc.so.6()(64bit) libc.so.6(GLIBC_2.14)(64bit) libc.so.6(GLIBC_2.2.5)(64bit) libc.so.6(GLIBC_2.3)(64bit) libc.so.6(GLIBC_2.3.4)(64bit) libc.so.6(GLIBC_2.4)(64bit) libc.so.6(GLIBC_2.7)(64bit) libc.so.6(GLIBC_2.8)(64bit) libpthread.so.0()(64bit) libpthread.so.0(GLIBC_2.2.5)(64bit) libpython2.7.so.1.0()(64bit) python(abi) = 2.7 rtld(GNU_HASH)
Conflicts: python2-perf
Processing files: kernel-alt-debuginfo-4.19.227-1.xcpng8.2.x86_64
Provides: kernel-alt-debuginfo = 4.19.227-1.xcpng8.2 kernel-alt-debuginfo(x86-64) = 4.19.227-1.xcpng8.2
Requires(rpmlib): rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 rpmlib(CompressedFileNames) <= 3.0.4-1
Checking for unpackaged file(s): /usr/lib/rpm/check-files /home/builder/rpmbuild/BUILDROOT/kernel-alt-4.19.227-1.xcpng8.2.x86_64
Wrote: /home/builder/rpmbuild/SRPMS/kernel-alt-4.19.227-1.xcpng8.2.src.rpm
Wrote: /home/builder/rpmbuild/RPMS/x86_64/kernel-alt-4.19.227-1.xcpng8.2.x86_64.rpm
Wrote: /home/builder/rpmbuild/RPMS/x86_64/kernel-alt-headers-4.19.227-1.xcpng8.2.x86_64.rpm
Wrote: /home/builder/rpmbuild/RPMS/x86_64/kernel-alt-devel-4.19.227-1.xcpng8.2.x86_64.rpm
Wrote: /home/builder/rpmbuild/RPMS/x86_64/perf-alt-4.19.227-1.xcpng8.2.x86_64.rpm
Wrote: /home/builder/rpmbuild/RPMS/x86_64/python2-perf-alt-4.19.227-1.xcpng8.2.x86_64.rpm
Wrote: /home/builder/rpmbuild/RPMS/x86_64/kernel-alt-debuginfo-4.19.227-1.xcpng8.2.x86_64.rpm
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.iAlDiX
+ umask 022
+ cd /home/builder/rpmbuild/BUILD
+ cd kernel-4.19.19
+ /usr/bin/rm -rf /home/builder/rpmbuild/BUILDROOT/kernel-alt-4.19.227-1.xcpng8.2.x86_64
+ exit 0
~

and my problem is with the %clean section, which removes the patched source code, which I'd really love to read, because it's not available as plain source in any repository, only as a mix of upstream source repos and Vates patch files.

I've been trying to find out how I can avoid the %clean section from being executed as part of rpmbuild, but I've failed to find whence this ultimate /usr/bin/rm -rf [...] is coming from or how to suppress it.

abufrejoval

@TheFrisianClause

From my experiments using a GTX1080ti you'll have to follow the instructions for generic pcie-passthrough to the letter, which mostly means that the passthrough needs to be done on both sides, on the Dom0 for relinquishing device control and for the DomU to pick up the device (section 5). Perhaps now that the restrictions from Nvidia have gone, Vates will kindly include some GUI support for those operations in XOA.

Note, that if your dGPU has multiple devices (e.g. my GTX 1080ti also has a USB-C controller on-board), both entries need to be added in a single 'xe vm-param-set' statement, otherwise only the latter device (USB in my case) will wind up in the VM.... (yeah, at least 30 minutes of puzzling on that one)

Of course, if the dGPU is your console device, it means flying blind afterwards, but I'm getting used to that with all the recent iGPUs as well (then again, I have some DisplayLink hardware that's currently unused and EL7/8 drivers for those have popped up recently...)

Thankfully the dreaded error 43 issues have gone away with the more recent Nvidia drivers, sadly Kepler support has been retired (got a lot of those still around), so you may want to preserve the latest CUDA 11.4 release as one that offers both, but for Maxwell you should still be fine.

Before trying to diagnose with the Nvidia drivers, you should be able to see the device transition via lspci on both sides, Dom0 and DomU.

abufrejoval

@olivierlambert
Merci Oliver (& Stormi), I found the issue was in front of the computer (again): I had missed to clone the repos before starting the run.py.

Seems to be working just now, at least it's compiling a ton of stuff....

abufrejoval

I wand to try my hand at building kernels with updated drivers to resolve issues around hardware that's too modern for the 4.19 LTS kernel.

I've been using the Docker scripts from the xcp-ng-build-env repo to build both the Xen kernel and the xapi parts, but simply using the analogous run.py command with "kernel" or "kernel-alt" won't work for those parts.

I've been trying to find some instructions on how to rebuild the kernel, but somehow they have eluded me...

Could someone please point me in the proper direction?

abufrejoval

So I got a GTX 1080ti to be passed through to a Windows 10 VM with CUDA 11.6 working on a Haswell Xeon E5-2696 v3 workstation.

I had to do it the very manual way, as described here and the nice GUI on XOA never worked for me (no GPU offered for pass-through): it's one of the few areas where oVirt is actually better so far.

Within that document it states "WARNING Due to a proprietary piece of code in XenServer, XCP-ng doesn't have (yet) support for NVIDIA vGPUs" and I have some inkling as to what that refers to and I don't know if the situation is much better for AMD or Intel GPUs.

For CUDA compute workloads, there is essentially no restriction any longer, Nvidia stopped boycoting the use of "consumer" GPUs inside VMs and the famous "error 43" seems a thing of the past.

For gaming or 3D support inside apps, you need a solution somewhat similar to how dGPUs operate in many notebooks: There the Nvidia dGPU output isn't really connected to any physical output port, instead the iGPU instead mirrors the dGPU output via a screen-copy operation: that seems to be both cheaper and even more energy efficient than physically switching the display output via some extra hardware and allows to put dGPUs in notebook completely to sleep on 2D-only workloads.

On Linux there is a workaround using VirtGL/TigerVNC which works quite ok with native Linux games or Steam-Proton, on Windows I've tried using Steam RemoteGaming. The problem is that it will use the XOA console to determine screen resolution and that limits it to 1024x768.

In short: The type of dGPU partitioning support that VDI (virtual desktop infrastructure) oriented Quadro GPUs offer, requires not just the hardware (which would seem to be in all consumer and Tesla chips and not even fused off), but the proper drivers to work and this functionality (as opposed to CUDA support) hasn't been made widely available by Nvidia, probably because it's an extremly high-priced secure CAD market.

Others who might use this stuff are cloud gaming services and who knows what Nvidia might make possible outside contracts, but I'd be rather sceptical you'd get much gaming use out of your K80--which is quite a shame these days, because in terms of hardware it's quite an FP32 beast and at least not artificially rate limited at FP64.

abufrejoval

@Greg_E

I've used fully passive Atom Mini-ITX systems from ASrock for oVirt for the last years. Contrary to Intel's documentation they can be upgraded to 32GB of RAM and I've also added 2.5GBit RealTek USB3 NICs for a bit better Gluster throughput.

I currently have one of them running Xcp-ng, unfortunately without support for the 2.5Gbit adapter so far, but the console is working.

The RealTek source code driver is an #ifdef nightmare, it won't compile clean on CentOS7 and I had to patch an older variant by hand to make it work there. Support on CentOS 8 was also a bit shoddy, endless diagnostic messages with the built-in driver, but functional. Abandoned that when CentOS EOL got announced.

Unfortunately Xcp-ng 8.2.1 loses the display on my Ryzen 5800U based notebook, just like all other more recent hardware devices I've tried, so I'm a bit surprised the 4700U works, when it's essentially the very same iGPU: does the Minisforum support legacy boot and did you use that?

Because I'm under the impression that the display loss is related to UEFI issues.

abufrejoval

@brodiecyber

So far I am using K80, P100 and V100 GPUs only on oVirt.

Those cards actually have no error 43 issues with Nvidia drivers, because they are not "consumer" cards and have always been permitted to run inside VMs via pass-through. I believe Nvidia has also relaxed the rules in more recent drivers, so even GTX/RTX cards might work these days (I haven't yet tried this again). On KVM I used to have to use special clauses in the KVM config file to trick the Nvidia driver not to throw error 43.

One thing to note on the K80 is that it's been deprecated recently and anything newer than CUDA 11.4 (and the matching driver) won't support it any more: I found out the hard way just doing a "yum update" on a CentOS7 base.

Since the K80 and Tesla type GPUs don't have a video port, you'll have to do remote gaming, which can be quite a complex beast. But for a demo I used the Singularity benchmark from Unigine (on Linux) with VirtGL to run it on a V100 in a DC half-way across the country on an iGPU-only ultrabook in game-mode: most colleagues unfortunately didn't even understand what was going on there and how it was demoing both the feasability and the limitations of cloud gaming (the graphics performance was impressive, but the latencies still terrible).