XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login
    1. Home
    2. benapetr
    3. Best
    B
    Offline
    • Profile
    • Following 0
    • Followers 0
    • Topics 2
    • Posts 9
    • Groups 0

    Posts

    Recent Best Controversial
    • Native Ceph RBD SM driver for XCP-ng

      Hello,

      I am using CEPH for a very long time, I started back in the times of old Citrix Xen Server (I think version 6 maybe?)

      I was using RBD for a long time in a hacky way via LVM with a shared attribute, which had major issues in latest xcp-ng, then I migrated to CephFS (which is relatively stable, but has its own issues related to the nature of CephFS - reduced performance, dependency on the MDS etc.).

      I finally decided to move outside of my comfort zone and try and write my own SM driver for actual RBD and after some testing move it to my production cluster. I know there is already another github project wrote by another guy, that is unmaintained for many years and has various versions. I didn't want to bother trying to understand how that one is implemented - I already know how to use RBD with CEPH, I just needed to put it into a SM driver. So I made my own.

      I will see how it goes, there are drawbacks already to that "native" RBD over CephFS - while IO performance is superior, the "meta" performance (creating disks, snapshots, deleting disks, rescanning SR) is actually slower because it relies on native CEPH APIs and doesn't just use very fast low-level "file access" of CephFS. But I still think it could be a valuable addition to people who need raw IO performance.

      My driver is fully open source and available here - I currently target XCP-ng 8.2 and SMAPIv2, because that's what I use on my production cluster which I am primarily making this for. But eventually I will try to test this with XCP-ng 8.3 and when SMAPIv3 is finally ready, I might port it there as well.

      Here is the code: https://github.com/benapetr/CephRBDSR/blob/master/CephRBDSR.py

      There is also an installation script that makes the installation of the SM driver pretty simple, may need a tweak as I am not sure if manually adding SM modules to /etc/xapi.conf is really a good idea 🙂

      Please note it's a work in progress, it's unfinished, and some features probably don't work yet.

      What is already tested and works:

      • SR creation, unplug, plug, rescan, stat
      • Basic RBD / VDI manipulation - creating VDI, deleting VDI, openning VDI / mapping VDI, copying VDI

      It's really just managing RBD block devices for you and uses aio to map VDIs to them

      What is not tested yet

      • Snapshots
      • Cluster behaviour

      I only recommend for use on dev xcp-ng environments at this moment. I think within a month I might have a fully tested version 1.0

      Feedback and suggestions welcome!

      posted in Development
      B
      benapetr
    • RE: Native Ceph RBD SM driver for XCP-ng

      Well, so I found it, the culprit is not even the SM framework itself, but rather the XAPI implementation, the problem is here:

      https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/xapi_vm_snapshot.ml#L250

      This logic is hard-coded into XAPI - when you revert to snapshot it:

      • First deletes the VDI image(s) of the VM (that is "root" for entire snapshot hierarchy - this is actually illegal in CEPH to delete such image)
      • Then it creates new VDIs from the snapshot
      • Then it modifies the VM and rewrites all VDI references to newly created clones from the snapshot

      This is fundamentally incompatible with the native CEPH snapshot logic because in CEPH:

      • You can create any amount of snapshots you want for an RBD image - but that makes it illegal to delete the RBD image as long as there is any snapshot. CEPH is using layered CoW for this, however snapshots are always read-only (which is actually fine in Xen world as it seems).
        • Creation of snapshot is very fast
        • Deletion of snapshot is also very fast
        • Rollback of snapshot is somehow also very fast (IDK how CEPH does this though)
      • You can create a clone (new RBD image) from a snapshot - that creates a parent reference to the snapshot - eg. snapshot can't be deleted until you make the new RBD independent of it via flatten operation (IO heavy and very slow).

      In simple words when:

      • You create image A
      • You create snapshot S (A -> S)

      You can very easily (cheap IO) drop S or revert A to S. However if you do what Xen does:

      • You create image A
      • You create snapshot S (A -> S)
      • You want to revert S - Xen clones S to B (A -> S -> B) and replaces VDI ref in VM from A to B

      Now it's really hard for CEPH to clean both A and S as long as B depends on both of them in the CoW hierarchy. Making B independent is IO heavy.

      What I can do as a nasty workaround is that I can hide VDI for A and when the user decides they want to delete S I would just hide S as well and schedule flatten of B as some background GC cleanup job (need to investigate what are my options here), which after finish would wipe S and subsequently A (if it was a last remaining snapshot for it).

      That would work, but still would be awfully inefficient software emulation of CoW, completely ignoring that we can get a real CoW from CEPH that is actually efficient and IO cheap (because it happens on storage-array level).

      Now I perfectly understand why nobody ever managed to deliver native RBD support to Xen - it's just that XAPI design makes it near-impossible. No wonder we ended up with weird (also inefficient) hacks like LVM pool on top of a single RBD, or CephFS.

      posted in Development
      B
      benapetr