XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    CEPHFS - why not CEPH RBD as SR?

    Scheduled Pinned Locked Moved Development
    15 Posts 2 Posters 3.2k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • P Offline
      petr.bena @olivierlambert
      last edited by

      @olivierlambert

      I noticed that RBDSR seems to be written for SMAPIv3, is that SMAPIv3 available in XCP-ng 8.2.1 already? Would it mean that this https://github.com/rposudnevskiy/RBDSR would be compatible with XCP-ng 8.2.1? In its readme it seems it was made for 7.5, but it's hard to tell.

      P 1 Reply Last reply Reply Quote 0
      • P Offline
        petr.bena @petr.bena
        last edited by

        I will experiment with it in my lab, but I was hoping I could stick with shared LVM concept, because it's really easy to manage and work with.

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          SMAPIv3 isn't production ready yet. It lacks storage migration, snapshots, exports/backups and such.

          P 1 Reply Last reply Reply Quote 0
          • P Offline
            petr.bena @olivierlambert
            last edited by

            @olivierlambert

            OK BTW is there any reason we are using such an ancient kernel? I am fairly sure that newer version of kernel addresses huge amount of various bugs that were identified in libceph kernel implementation over the time. Or is there any guide how to build a custom XCP-ng dom0 kernel from newer upstream? (I understand - no support, no warranty etc.)

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Yes, there's some legacy patches that are not working with a recent kernel. All of this will be fixed for our major release next year (9.0)

              P 1 Reply Last reply Reply Quote 1
              • P Offline
                petr.bena @olivierlambert
                last edited by petr.bena

                @olivierlambert great! Is there any repository where I can find those patches? Maybe I can try to port them into a newer kernel and try if it resolves some of the problems I am seeing. Thanks

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  It's more complicated than that, I'm afraid. But believe me, this is taken into account by both XenServer and XCP-ng teams 🙂

                  1 Reply Last reply Reply Quote 1
                  • P Offline
                    petr.bena
                    last edited by

                    OK, on other note, another concept that others might find interesting in the future, and that would be very nice if supported in XCP-ng 9 or newer versions, would be ability to run user provided containers in some docker-like environment. I know it's considered a very bad practice to run anything extra in dom0, but I noticed that CEPH OSDs (the daemons responsible for managing the actual storage units like HDDs or SSDs in CEPH) benefit from being as close to the hardware as possible.

                    In a hyperconvergent setup I am working with, this means that ideally OSDs should have as direct access to the storage hardware as possible, which is unfortunately possible only in few ways:

                    • Fake "passthrough" via udev SR (that's what I do now - not great, not terrible, still a lot of "emulation" and overhead from TAP mechanism and many hypercalls needed to access the HW, wasting the CPU and other resources)
                    • Real passthrough via PCI-E (VT-d) passthrough - this works great in theory, but since you are giving direct access to storage controller, you need a dedicated one (HBA card) just for the VM that runs OSDs - works fast but requires expensive extra hardware and is complex to setup.
                    • Utilize fact that dom0 has direct access to HW and run OSDs directly in dom0 - that's where container engine like docker would be handy.

                    Last option is cheapest way to get a real direct HW access while sharing storage controller with dom0 in some at least little bit isolated environment, in which you don't need to taint the runtime of dom0 too much, but unfortunately my experiments with this failed so far for couple reasons:

                    • Too old kernel with some modules missing (Docker needs a network bridge kernel module to work properly, which is not available unless you turn on bridge network backend)
                    • Too old runtime (based on CentOS 7)

                    If this kind of containerization would be supported by default, it would make it possible to host these "pluggable" software defined storage solutions directly in dom0, utilizing HW much more efficiently.

                    1 Reply Last reply Reply Quote 0
                    • olivierlambertO Offline
                      olivierlambert Vates 🪐 Co-Founder CEO
                      last edited by

                      This is why storage domain will be useful (dedicated VM for storage with device like NVMe passthrough).

                      But there's also things to square before, like having XAPI and SMAPI being aware of such "mission" for those dedicated VM. That's also planned on our side 🙂

                      1 Reply Last reply Reply Quote 0
                      • P Offline
                        petr.bena
                        last edited by petr.bena

                        Hello, just a follow-up I figured out probable fix for performance issues (the locking issue seems to have disappeared on its own I suspect it happened only due to upgrade process as pool contained mix of 8.0 and 8.2 hosts)

                        It was caused by very slow (several second) executions of basic LVM commands - pvs / lvs etc took many seconds. When started with debug options it seems it took excessive amount of time scanning iSCSI volumes in /dev/mapper as well as the individual LVs that were also presented in /dev/mapper as if they were PVs - it actually subsequently ignored them but still those were (in my case) hundreds of LVs and each had to be open to check metadata and size.

                        After modifying /etc/lvm/master/lvm.conf by adding this:

                        # /dev/sd.* is there to avoid scanning RAW disks that are used via UDEV for storage backend
                        filter = [ "r|/dev/sd.*|", "r|/dev/mapper/.*|" ]
                        

                        Performance of LVM commands improved from ~5 seconds to less than 0.1 second and issue with slow startup / shutdown / snapshot of VMs (sometimes they took almost 10 minutes) was resolved.

                        Of course this filter needs to be adjusted based on specific needs of the given situation. In my case both /dev/sd* as well as /dev/mapper devices are NEVER used by LVM backed SRs, so it was safe to ignore them for me. (all my LVM SRs are from /dev/rbd/)

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post