CEPHFS - why not CEPH RBD as SR?

petr.bena

I noticed that RBDSR seems to be written for SMAPIv3, is that SMAPIv3 available in XCP-ng 8.2.1 already? Would it mean that this https://github.com/rposudnevskiy/RBDSR would be compatible with XCP-ng 8.2.1? In its readme it seems it was made for 7.5, but it's hard to tell.

petr.bena

I will experiment with it in my lab, but I was hoping I could stick with shared LVM concept, because it's really easy to manage and work with.

olivierlambert

SMAPIv3 isn't production ready yet. It lacks storage migration, snapshots, exports/backups and such.

petr.bena

@olivierlambert

OK BTW is there any reason we are using such an ancient kernel? I am fairly sure that newer version of kernel addresses huge amount of various bugs that were identified in libceph kernel implementation over the time. Or is there any guide how to build a custom XCP-ng dom0 kernel from newer upstream? (I understand - no support, no warranty etc.)

olivierlambert

Yes, there's some legacy patches that are not working with a recent kernel. All of this will be fixed for our major release next year (9.0)

petr.bena

@olivierlambert great! Is there any repository where I can find those patches? Maybe I can try to port them into a newer kernel and try if it resolves some of the problems I am seeing. Thanks

olivierlambert

It's more complicated than that, I'm afraid. But believe me, this is taken into account by both XenServer and XCP-ng teams

petr.bena

OK, on other note, another concept that others might find interesting in the future, and that would be very nice if supported in XCP-ng 9 or newer versions, would be ability to run user provided containers in some docker-like environment. I know it's considered a very bad practice to run anything extra in dom0, but I noticed that CEPH OSDs (the daemons responsible for managing the actual storage units like HDDs or SSDs in CEPH) benefit from being as close to the hardware as possible.

In a hyperconvergent setup I am working with, this means that ideally OSDs should have as direct access to the storage hardware as possible, which is unfortunately possible only in few ways:

Fake "passthrough" via udev SR (that's what I do now - not great, not terrible, still a lot of "emulation" and overhead from TAP mechanism and many hypercalls needed to access the HW, wasting the CPU and other resources)
Real passthrough via PCI-E (VT-d) passthrough - this works great in theory, but since you are giving direct access to storage controller, you need a dedicated one (HBA card) just for the VM that runs OSDs - works fast but requires expensive extra hardware and is complex to setup.
Utilize fact that dom0 has direct access to HW and run OSDs directly in dom0 - that's where container engine like docker would be handy.

Last option is cheapest way to get a real direct HW access while sharing storage controller with dom0 in some at least little bit isolated environment, in which you don't need to taint the runtime of dom0 too much, but unfortunately my experiments with this failed so far for couple reasons:

Too old kernel with some modules missing (Docker needs a network bridge kernel module to work properly, which is not available unless you turn on bridge network backend)
Too old runtime (based on CentOS 7)

If this kind of containerization would be supported by default, it would make it possible to host these "pluggable" software defined storage solutions directly in dom0, utilizing HW much more efficiently.

olivierlambert

This is why storage domain will be useful (dedicated VM for storage with device like NVMe passthrough).

But there's also things to square before, like having XAPI and SMAPI being aware of such "mission" for those dedicated VM. That's also planned on our side

petr.bena

Hello, just a follow-up I figured out probable fix for performance issues (the locking issue seems to have disappeared on its own I suspect it happened only due to upgrade process as pool contained mix of 8.0 and 8.2 hosts)

It was caused by very slow (several second) executions of basic LVM commands - pvs / lvs etc took many seconds. When started with debug options it seems it took excessive amount of time scanning iSCSI volumes in /dev/mapper as well as the individual LVs that were also presented in /dev/mapper as if they were PVs - it actually subsequently ignored them but still those were (in my case) hundreds of LVs and each had to be open to check metadata and size.

After modifying /etc/lvm/master/lvm.conf by adding this:

# /dev/sd.* is there to avoid scanning RAW disks that are used via UDEV for storage backend
filter = [ "r|/dev/sd.*|", "r|/dev/mapper/.*|" ]

Performance of LVM commands improved from ~5 seconds to less than 0.1 second and issue with slow startup / shutdown / snapshot of VMs (sometimes they took almost 10 minutes) was resolved.

Of course this filter needs to be adjusted based on specific needs of the given situation. In my case both /dev/sd* as well as /dev/mapper devices are NEVER used by LVM backed SRs, so it was safe to ignore them for me. (all my LVM SRs are from /dev/rbd/)