CEPHFS - why not CEPH RBD as SR?

petr.bena

Hello,

I was using XCP-ng 8.0 (my bad that I updated to 8.2.1 ) for years with unofficial combination of CEPH RBD (shared block devices in /dev/rbd0, /dev/rbd1 etc.) that I created a shared LVM SR on (regular LVM with attribute shared=true).

This may sound crazy on first look, but was working extremely well - first of all there seems to be a lock mechanism implemented (probably on xenopsd level) that made sure that only LVs belonging to VMs were accessed by hosts that were hosting them and since VG was modified only by the pool master, I never had any problem with any kind of metadata corruption.

On top of that RBD, being a block device image, intended for VM virtual disk hosting was much faster than CEPHFS or NFS and very easy to manage (I could set a targe size for image in CEPH and then just have XCP-ng "partition" it via LVM).

Everything was great until the day when I decided that after many years its perhaps good time to upgrade to new XCP-ng - where it all stopped working, probably due to overhaul of SM subsystem and new LVM locking mechanisms.

During upgrade (could be because one host was still running 8.0) it all catastrophically failed (there were probably some lock clashes), I got it to restore, but now mounted only on one host to avoid this locking issues again.

Main problem beside it's not stable is that it's terribly slow - now even start of a single VM with single disk in this SR takes about 6 minutes. Once it starts the disk access is very good and fast, but any low level manipulations like disk adding / snapshot creation etc. take extremely long time.

I suppose that reason is that I am simply using LVMSR for something it's probably not intended for, but then the question is - why is it not intended for this? Since it already supports shared flag why couldn't it be used to work on shared disk image like RBD (or some HBA LUN)?

Perhaps it would make sense to even implement some CLVM SR that would support these networked locks of cLVM? Or maybe just implement RBDSR? There is already a git project for it, but I don't know if it works with latest XCP-ng https://github.com/rposudnevskiy/RBDSR

CEPH is extremely robust and versatile solution, in many ways far superior to Gluster and similar stacks. I think that community would benefit of it tremendously.

I understand that implementing native RBD support is a lot of work, much more than just copy pasting NFS SR plugin and changing the way it mounts (as it was done with CEPHFS), but perhaps we could consider getting this done via that LVMSR backend - it's not a bad design, only thing that needs to be refined in there is reliable and fast locking, which now isn't as LVM is probably now intended to be used only on single host.

I think with few modifications to LVM SR this shared LVM concept could be easily implemented.

olivierlambert

Hi,

We are now going for XOSTOR (not using Gluster anymore), that might be your solution in your case

petr.bena

@olivierlambert hello, and how is that implemented under the hood, what tech is it using?

I really like CEPH because it extremely robust, I had so many weird disk failures and recovery scenarios already in past years that I was working with it and it always recovered from everything with ease pretty much online and with no downtime, supporting very advanced disk setup. It's extremely sophisticated in the way it replicates and manages data and has huge community around it.

olivierlambert

Please read: https://xcp-ng.org/forum/topic/5361/xostor-hyperconvergence-preview

It's very easy to install and setup

olivierlambert

Obviously, as soon SMAPIv3 is "complete" enough, it will be easy to write a "Ceph driver" for it and contributions will be welcome

petr.bena

@olivierlambert

I noticed that RBDSR seems to be written for SMAPIv3, is that SMAPIv3 available in XCP-ng 8.2.1 already? Would it mean that this https://github.com/rposudnevskiy/RBDSR would be compatible with XCP-ng 8.2.1? In its readme it seems it was made for 7.5, but it's hard to tell.

petr.bena

I will experiment with it in my lab, but I was hoping I could stick with shared LVM concept, because it's really easy to manage and work with.

olivierlambert

SMAPIv3 isn't production ready yet. It lacks storage migration, snapshots, exports/backups and such.

petr.bena

@olivierlambert

OK BTW is there any reason we are using such an ancient kernel? I am fairly sure that newer version of kernel addresses huge amount of various bugs that were identified in libceph kernel implementation over the time. Or is there any guide how to build a custom XCP-ng dom0 kernel from newer upstream? (I understand - no support, no warranty etc.)

olivierlambert

Yes, there's some legacy patches that are not working with a recent kernel. All of this will be fixed for our major release next year (9.0)

petr.bena

@olivierlambert great! Is there any repository where I can find those patches? Maybe I can try to port them into a newer kernel and try if it resolves some of the problems I am seeing. Thanks

olivierlambert

It's more complicated than that, I'm afraid. But believe me, this is taken into account by both XenServer and XCP-ng teams

petr.bena

OK, on other note, another concept that others might find interesting in the future, and that would be very nice if supported in XCP-ng 9 or newer versions, would be ability to run user provided containers in some docker-like environment. I know it's considered a very bad practice to run anything extra in dom0, but I noticed that CEPH OSDs (the daemons responsible for managing the actual storage units like HDDs or SSDs in CEPH) benefit from being as close to the hardware as possible.

In a hyperconvergent setup I am working with, this means that ideally OSDs should have as direct access to the storage hardware as possible, which is unfortunately possible only in few ways:

Fake "passthrough" via udev SR (that's what I do now - not great, not terrible, still a lot of "emulation" and overhead from TAP mechanism and many hypercalls needed to access the HW, wasting the CPU and other resources)
Real passthrough via PCI-E (VT-d) passthrough - this works great in theory, but since you are giving direct access to storage controller, you need a dedicated one (HBA card) just for the VM that runs OSDs - works fast but requires expensive extra hardware and is complex to setup.
Utilize fact that dom0 has direct access to HW and run OSDs directly in dom0 - that's where container engine like docker would be handy.

Last option is cheapest way to get a real direct HW access while sharing storage controller with dom0 in some at least little bit isolated environment, in which you don't need to taint the runtime of dom0 too much, but unfortunately my experiments with this failed so far for couple reasons:

Too old kernel with some modules missing (Docker needs a network bridge kernel module to work properly, which is not available unless you turn on bridge network backend)
Too old runtime (based on CentOS 7)

If this kind of containerization would be supported by default, it would make it possible to host these "pluggable" software defined storage solutions directly in dom0, utilizing HW much more efficiently.

olivierlambert

This is why storage domain will be useful (dedicated VM for storage with device like NVMe passthrough).

But there's also things to square before, like having XAPI and SMAPI being aware of such "mission" for those dedicated VM. That's also planned on our side

petr.bena

Hello, just a follow-up I figured out probable fix for performance issues (the locking issue seems to have disappeared on its own I suspect it happened only due to upgrade process as pool contained mix of 8.0 and 8.2 hosts)

It was caused by very slow (several second) executions of basic LVM commands - pvs / lvs etc took many seconds. When started with debug options it seems it took excessive amount of time scanning iSCSI volumes in /dev/mapper as well as the individual LVs that were also presented in /dev/mapper as if they were PVs - it actually subsequently ignored them but still those were (in my case) hundreds of LVs and each had to be open to check metadata and size.

After modifying /etc/lvm/master/lvm.conf by adding this:

# /dev/sd.* is there to avoid scanning RAW disks that are used via UDEV for storage backend
filter = [ "r|/dev/sd.*|", "r|/dev/mapper/.*|" ]

Performance of LVM commands improved from ~5 seconds to less than 0.1 second and issue with slow startup / shutdown / snapshot of VMs (sometimes they took almost 10 minutes) was resolved.

Of course this filter needs to be adjusted based on specific needs of the given situation. In my case both /dev/sd* as well as /dev/mapper devices are NEVER used by LVM backed SRs, so it was safe to ignore them for me. (all my LVM SRs are from /dev/rbd/)