XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login
    1. Home
    2. benapetr
    B
    Offline
    • Profile
    • Following 0
    • Followers 0
    • Topics 2
    • Posts 9
    • Groups 0

    benapetr

    @benapetr

    5
    Reputation
    1
    Profile views
    9
    Posts
    0
    Followers
    0
    Following
    Joined
    Last Online

    benapetr Unfollow Follow

    Best posts made by benapetr

    • Native Ceph RBD SM driver for XCP-ng

      Hello,

      I am using CEPH for a very long time, I started back in the times of old Citrix Xen Server (I think version 6 maybe?)

      I was using RBD for a long time in a hacky way via LVM with a shared attribute, which had major issues in latest xcp-ng, then I migrated to CephFS (which is relatively stable, but has its own issues related to the nature of CephFS - reduced performance, dependency on the MDS etc.).

      I finally decided to move outside of my comfort zone and try and write my own SM driver for actual RBD and after some testing move it to my production cluster. I know there is already another github project wrote by another guy, that is unmaintained for many years and has various versions. I didn't want to bother trying to understand how that one is implemented - I already know how to use RBD with CEPH, I just needed to put it into a SM driver. So I made my own.

      I will see how it goes, there are drawbacks already to that "native" RBD over CephFS - while IO performance is superior, the "meta" performance (creating disks, snapshots, deleting disks, rescanning SR) is actually slower because it relies on native CEPH APIs and doesn't just use very fast low-level "file access" of CephFS. But I still think it could be a valuable addition to people who need raw IO performance.

      My driver is fully open source and available here - I currently target XCP-ng 8.2 and SMAPIv2, because that's what I use on my production cluster which I am primarily making this for. But eventually I will try to test this with XCP-ng 8.3 and when SMAPIv3 is finally ready, I might port it there as well.

      Here is the code: https://github.com/benapetr/CephRBDSR/blob/master/CephRBDSR.py

      There is also an installation script that makes the installation of the SM driver pretty simple, may need a tweak as I am not sure if manually adding SM modules to /etc/xapi.conf is really a good idea 🙂

      Please note it's a work in progress, it's unfinished, and some features probably don't work yet.

      What is already tested and works:

      • SR creation, unplug, plug, rescan, stat
      • Basic RBD / VDI manipulation - creating VDI, deleting VDI, openning VDI / mapping VDI, copying VDI

      It's really just managing RBD block devices for you and uses aio to map VDIs to them

      What is not tested yet

      • Snapshots
      • Cluster behaviour

      I only recommend for use on dev xcp-ng environments at this moment. I think within a month I might have a fully tested version 1.0

      Feedback and suggestions welcome!

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      Well, so I found it, the culprit is not even the SM framework itself, but rather the XAPI implementation, the problem is here:

      https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/xapi_vm_snapshot.ml#L250

      This logic is hard-coded into XAPI - when you revert to snapshot it:

      • First deletes the VDI image(s) of the VM (that is "root" for entire snapshot hierarchy - this is actually illegal in CEPH to delete such image)
      • Then it creates new VDIs from the snapshot
      • Then it modifies the VM and rewrites all VDI references to newly created clones from the snapshot

      This is fundamentally incompatible with the native CEPH snapshot logic because in CEPH:

      • You can create any amount of snapshots you want for an RBD image - but that makes it illegal to delete the RBD image as long as there is any snapshot. CEPH is using layered CoW for this, however snapshots are always read-only (which is actually fine in Xen world as it seems).
        • Creation of snapshot is very fast
        • Deletion of snapshot is also very fast
        • Rollback of snapshot is somehow also very fast (IDK how CEPH does this though)
      • You can create a clone (new RBD image) from a snapshot - that creates a parent reference to the snapshot - eg. snapshot can't be deleted until you make the new RBD independent of it via flatten operation (IO heavy and very slow).

      In simple words when:

      • You create image A
      • You create snapshot S (A -> S)

      You can very easily (cheap IO) drop S or revert A to S. However if you do what Xen does:

      • You create image A
      • You create snapshot S (A -> S)
      • You want to revert S - Xen clones S to B (A -> S -> B) and replaces VDI ref in VM from A to B

      Now it's really hard for CEPH to clean both A and S as long as B depends on both of them in the CoW hierarchy. Making B independent is IO heavy.

      What I can do as a nasty workaround is that I can hide VDI for A and when the user decides they want to delete S I would just hide S as well and schedule flatten of B as some background GC cleanup job (need to investigate what are my options here), which after finish would wipe S and subsequently A (if it was a last remaining snapshot for it).

      That would work, but still would be awfully inefficient software emulation of CoW, completely ignoring that we can get a real CoW from CEPH that is actually efficient and IO cheap (because it happens on storage-array level).

      Now I perfectly understand why nobody ever managed to deliver native RBD support to Xen - it's just that XAPI design makes it near-impossible. No wonder we ended up with weird (also inefficient) hacks like LVM pool on top of a single RBD, or CephFS.

      posted in Development
      B
      benapetr

    Latest posts made by benapetr

    • Building XCP-ng from source code

      Hello,

      Do we have any resources on how to build a fully working XCP-ng (dom0, installer, .iso etc.) from source code?

      There are few "helper" repos with scripts like the dev container for SRPM building, but I couldn't find any script set to actually build the installer for dom0 OS (which is IMHO some fork of CentOS 7).

      I wanted to experiment trying to get a working dom0 based on something much newer with latest kernel, like Rocky Linux 10, but I don't want to reinvent everything, I suppose there are some build scripts right? Or how do you produce the .iso file?

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      @psafont thanks for the reply, but isn't that 16 year old logic part of XAPI? I mean - this same hacky logic is present in SMAPIv3 isn't it?

      I was going through SMAPIv3 docs and from SM driver perspective (feature-wise) it doesn't seem much different, it looks to me more like many cosmetic changes that make packaging and modularization easier (definitely a good thing), but don't really change any fundamental SM logic - the RPCs are all same as in SMAPIv1, even porting my own driver is probably going to be pretty trivial, it's just about splitting it into multiple files and add some wrappers around it, but it still won't solve my problem - the rollback RPC is just not there, so I would need to instead support this "rollback by making another snapshot of a snapshot" logic enforced by XAPI

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      Meanwhile until I finish the RBD-native driver which is probably going to take much longer than I anticipated and is probably never going to have "ideal snapshot rollback mechanism", I also decided to create another driver called LVMoRBD which is essentially same as LVMoHBA - it builds LVM-SR on top of RBD block device and is meant to work without need for some complex hacks - https://github.com/benapetr/CephRBDSR/commit/fbde8b49b180d4de60ffea477dffe712f07a4d07

      it's really rather a trivial wrapper, it's main benefits as described in the commit message are the ability to auto-mount the RBD image on reboot and to use its own LVM config that enables RBD devices (so no need to modify the default shipped with xcp-ng) and some other things that make creation of LVMSR on top of RBD far easier and natural.

      Again - it's not production ready yet, I will be doing many tests on it and probably still need to fix some SCSI related false-positive errors that sometimes appear in SM log. I estimate it will be ready much sooner than RBD-native driver.

      LVM on top of RBD is not ideal as there is some tiny overhead, but still probably better than CephFS when it comes to RAW performance (RBD really is something like LUN from HBA)

      And yes I do plan to create SMAPIv3 equivalents later when I learn how 😉 for now I am still targetting XCP-ng 8.2 as that's what I use in production, and I haven't seen many SMAPIv3 drivers there.

      0 benapetr committed to benapetr/CephRBDSR
      new driver LVMoRBD
      
      This is a simple wrapper for LVM on RBD which is something I have
      experience with from the past - it's actually possible to do this using
      native LVMSR driver with few hacks - such as modifying the
      master/lvm.conf to work with RBD devices
      
      This SR driver is meant to streamline the access without any need for
      custom hacks.
      
      Its main benefits over using LVMSR:
      
      * It automatically maps and unmaps rbd devices for you on reboot
      * It uses some wrappers to suppress SCSI meta-calls that cause
        unnecessary false-positive warnings in SM log (rbd devices don't have
        SCSI ID)
      * It automatically creates RBD device for you with compatibility
        settings known to work fine with XCP-ng dom0 kernel
      * It uses its custom LVM config file, so you don't need to expose RBD
        LVM structure to hypervisor, or override the default config file
      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      Well, so I found it, the culprit is not even the SM framework itself, but rather the XAPI implementation, the problem is here:

      https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/xapi_vm_snapshot.ml#L250

      This logic is hard-coded into XAPI - when you revert to snapshot it:

      • First deletes the VDI image(s) of the VM (that is "root" for entire snapshot hierarchy - this is actually illegal in CEPH to delete such image)
      • Then it creates new VDIs from the snapshot
      • Then it modifies the VM and rewrites all VDI references to newly created clones from the snapshot

      This is fundamentally incompatible with the native CEPH snapshot logic because in CEPH:

      • You can create any amount of snapshots you want for an RBD image - but that makes it illegal to delete the RBD image as long as there is any snapshot. CEPH is using layered CoW for this, however snapshots are always read-only (which is actually fine in Xen world as it seems).
        • Creation of snapshot is very fast
        • Deletion of snapshot is also very fast
        • Rollback of snapshot is somehow also very fast (IDK how CEPH does this though)
      • You can create a clone (new RBD image) from a snapshot - that creates a parent reference to the snapshot - eg. snapshot can't be deleted until you make the new RBD independent of it via flatten operation (IO heavy and very slow).

      In simple words when:

      • You create image A
      • You create snapshot S (A -> S)

      You can very easily (cheap IO) drop S or revert A to S. However if you do what Xen does:

      • You create image A
      • You create snapshot S (A -> S)
      • You want to revert S - Xen clones S to B (A -> S -> B) and replaces VDI ref in VM from A to B

      Now it's really hard for CEPH to clean both A and S as long as B depends on both of them in the CoW hierarchy. Making B independent is IO heavy.

      What I can do as a nasty workaround is that I can hide VDI for A and when the user decides they want to delete S I would just hide S as well and schedule flatten of B as some background GC cleanup job (need to investigate what are my options here), which after finish would wipe S and subsequently A (if it was a last remaining snapshot for it).

      That would work, but still would be awfully inefficient software emulation of CoW, completely ignoring that we can get a real CoW from CEPH that is actually efficient and IO cheap (because it happens on storage-array level).

      Now I perfectly understand why nobody ever managed to deliver native RBD support to Xen - it's just that XAPI design makes it near-impossible. No wonder we ended up with weird (also inefficient) hacks like LVM pool on top of a single RBD, or CephFS.

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      So, I am making some progress with my driver, but as I start understanding SM API more, I am also discovering several flaws in its implementation that limits its ability to adapt to advanced storages such as CEPH - and probably main reason why CEPH was never integrated properly and why its existing community-made integration are so overly and unnecessarily complex.

      I somewhat hope that SMAPIv3 solves this problem, but I am afraid it doesn't. For now I added this comment into header of that SMAPIv1 driver, which explains the problem:

      # Important notes:
      # Snapshot logic in this SM driver is imperfect at this moment, because the way snapshots are implemented in Xen
      # are fundamentally different from how snapshots work in CEPH, and sadly Xen API doesn't let the SM driver
      # implement this logic itself and instead forces its own (somewhat flawed and inefficient logic)
      
      # That means when the "revert to snapshot" is executed via admin UI - this driver is not notified about it in any way.
      # If it was, it would be able to execute a trivial "rbd rollback" CEPH action which would result in instant rollback
      
      # Instead Xen decides to create a clone from the snapshot by calling clone(), which creates another RBD
      # that is depending on parent snapshot, which is depending on original RBD image we wanted to rollback.
      # Then it calls delete on the original image which is parent of this entire new hierarchy.
      
      # This image is now impossible to delete, because it has a snapshot. Which means we need to perform a background
      # flatten operation, that performs physical 1:1 copy of entire image to the new clone and then destroys the snapshot
      # and original image.
      
      # This is brutally inefficient in comparison to native rollback (as in hours instead of seconds), but it seems with
      # current SM driver implementation it's not possible to do this efficiently, this requires a fix in SM API
      

      Basically - XAPI has its own logic for how snapshots are created and managed and it forces this logic in exact same implementation on everyone - even in case that underlying storage contains its own snapshot mechanisms that can be used instead that would be FAR more efficient. Because this logic is impossible to override, hook to, or change, there isn't really any efficient way to implement snapshot logic on CEPH level.

      My suggestion - instead of forcing some internal snapshot logic on SM drivers, abstract it away, just send high level requests to SM drivers such as:

      • Create a snapshot of this VDI
      • Revert this VDI to this snapshot

      I understand for many SM drivers this could be a problem as same logic would need to be repeated in them, maybe you can make it so that if SM doesn't implement its own snapshot logic, you fallback to that default one that is implemented now?

      Anyway - the way SM subsystem (at least V1, but I suspect V3 isn't any better in this) works, you can't utilize storage-level efficient features - instead you are reinventing the wheel and implementing same logic in software in extremely inefficient way.

      But maybe I just overlook something, that's just how it appears to me, as there is absolutely no "revert to snapshot" overridable entry point in SM right now.

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      f1f6fafc-6f75-4596-9077-4c7f4a8c6b38-image.png

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      @olivierlambert that documentation is really interesting, those diagrams are full of examples of accessing Ceph and RBD via librados, which is exactly what I am doing here LOL

      Did you design those diagram based on some existing driver? It seems someone in your team already had to study Ceph concepts if it's in your very example documentation. Does the RBD driver already exists somewhere in your lab?

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      @olivierlambert by SMAPIv2 I mean the predecessor of SMAPIv3 (eg. the old school SM driver, same as the CephFSSR driver) or maybe that's called SMAPIv1? IDK I am genuinely lost in these conventions and have a hard time finding any relevant docs 🙂

      posted in Development
      B
      benapetr
    • Native Ceph RBD SM driver for XCP-ng

      Hello,

      I am using CEPH for a very long time, I started back in the times of old Citrix Xen Server (I think version 6 maybe?)

      I was using RBD for a long time in a hacky way via LVM with a shared attribute, which had major issues in latest xcp-ng, then I migrated to CephFS (which is relatively stable, but has its own issues related to the nature of CephFS - reduced performance, dependency on the MDS etc.).

      I finally decided to move outside of my comfort zone and try and write my own SM driver for actual RBD and after some testing move it to my production cluster. I know there is already another github project wrote by another guy, that is unmaintained for many years and has various versions. I didn't want to bother trying to understand how that one is implemented - I already know how to use RBD with CEPH, I just needed to put it into a SM driver. So I made my own.

      I will see how it goes, there are drawbacks already to that "native" RBD over CephFS - while IO performance is superior, the "meta" performance (creating disks, snapshots, deleting disks, rescanning SR) is actually slower because it relies on native CEPH APIs and doesn't just use very fast low-level "file access" of CephFS. But I still think it could be a valuable addition to people who need raw IO performance.

      My driver is fully open source and available here - I currently target XCP-ng 8.2 and SMAPIv2, because that's what I use on my production cluster which I am primarily making this for. But eventually I will try to test this with XCP-ng 8.3 and when SMAPIv3 is finally ready, I might port it there as well.

      Here is the code: https://github.com/benapetr/CephRBDSR/blob/master/CephRBDSR.py

      There is also an installation script that makes the installation of the SM driver pretty simple, may need a tweak as I am not sure if manually adding SM modules to /etc/xapi.conf is really a good idea 🙂

      Please note it's a work in progress, it's unfinished, and some features probably don't work yet.

      What is already tested and works:

      • SR creation, unplug, plug, rescan, stat
      • Basic RBD / VDI manipulation - creating VDI, deleting VDI, openning VDI / mapping VDI, copying VDI

      It's really just managing RBD block devices for you and uses aio to map VDIs to them

      What is not tested yet

      • Snapshots
      • Cluster behaviour

      I only recommend for use on dev xcp-ng environments at this moment. I think within a month I might have a fully tested version 1.0

      Feedback and suggestions welcome!

      posted in Development
      B
      benapetr