XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Native Ceph RBD SM driver for XCP-ng

    Scheduled Pinned Locked Moved Development
    18 Posts 5 Posters 1.2k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B Offline
      benapetr
      last edited by benapetr

      Well, so I found it, the culprit is not even the SM framework itself, but rather the XAPI implementation, the problem is here:

      https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/xapi_vm_snapshot.ml#L250

      This logic is hard-coded into XAPI - when you revert to snapshot it:

      • First deletes the VDI image(s) of the VM (that is "root" for entire snapshot hierarchy - this is actually illegal in CEPH to delete such image)
      • Then it creates new VDIs from the snapshot
      • Then it modifies the VM and rewrites all VDI references to newly created clones from the snapshot

      This is fundamentally incompatible with the native CEPH snapshot logic because in CEPH:

      • You can create any amount of snapshots you want for an RBD image - but that makes it illegal to delete the RBD image as long as there is any snapshot. CEPH is using layered CoW for this, however snapshots are always read-only (which is actually fine in Xen world as it seems).
        • Creation of snapshot is very fast
        • Deletion of snapshot is also very fast
        • Rollback of snapshot is somehow also very fast (IDK how CEPH does this though)
      • You can create a clone (new RBD image) from a snapshot - that creates a parent reference to the snapshot - eg. snapshot can't be deleted until you make the new RBD independent of it via flatten operation (IO heavy and very slow).

      In simple words when:

      • You create image A
      • You create snapshot S (A -> S)

      You can very easily (cheap IO) drop S or revert A to S. However if you do what Xen does:

      • You create image A
      • You create snapshot S (A -> S)
      • You want to revert S - Xen clones S to B (A -> S -> B) and replaces VDI ref in VM from A to B

      Now it's really hard for CEPH to clean both A and S as long as B depends on both of them in the CoW hierarchy. Making B independent is IO heavy.

      What I can do as a nasty workaround is that I can hide VDI for A and when the user decides they want to delete S I would just hide S as well and schedule flatten of B as some background GC cleanup job (need to investigate what are my options here), which after finish would wipe S and subsequently A (if it was a last remaining snapshot for it).

      That would work, but still would be awfully inefficient software emulation of CoW, completely ignoring that we can get a real CoW from CEPH that is actually efficient and IO cheap (because it happens on storage-array level).

      Now I perfectly understand why nobody ever managed to deliver native RBD support to Xen - it's just that XAPI design makes it near-impossible. No wonder we ended up with weird (also inefficient) hacks like LVM pool on top of a single RBD, or CephFS.

      psafontP 1 Reply Last reply Reply Quote 1
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Food for thoughts for the @Team-Storage and @Team-XAPI-Network

        1 Reply Last reply Reply Quote 0
        • B Offline
          benapetr
          last edited by

          Meanwhile until I finish the RBD-native driver which is probably going to take much longer than I anticipated and is probably never going to have "ideal snapshot rollback mechanism", I also decided to create another driver called LVMoRBD which is essentially same as LVMoHBA - it builds LVM-SR on top of RBD block device and is meant to work without need for some complex hacks - https://github.com/benapetr/CephRBDSR/commit/fbde8b49b180d4de60ffea477dffe712f07a4d07

          it's really rather a trivial wrapper, it's main benefits as described in the commit message are the ability to auto-mount the RBD image on reboot and to use its own LVM config that enables RBD devices (so no need to modify the default shipped with xcp-ng) and some other things that make creation of LVMSR on top of RBD far easier and natural.

          Again - it's not production ready yet, I will be doing many tests on it and probably still need to fix some SCSI related false-positive errors that sometimes appear in SM log. I estimate it will be ready much sooner than RBD-native driver.

          LVM on top of RBD is not ideal as there is some tiny overhead, but still probably better than CephFS when it comes to RAW performance (RBD really is something like LUN from HBA)

          And yes I do plan to create SMAPIv3 equivalents later when I learn how 😉 for now I am still targetting XCP-ng 8.2 as that's what I use in production, and I haven't seen many SMAPIv3 drivers there.

          0 benapetr committed to benapetr/CephRBDSR
          new driver LVMoRBD
          
          This is a simple wrapper for LVM on RBD which is something I have
          experience with from the past - it's actually possible to do this using
          native LVMSR driver with few hacks - such as modifying the
          master/lvm.conf to work with RBD devices
          
          This SR driver is meant to streamline the access without any need for
          custom hacks.
          
          Its main benefits over using LVMSR:
          
          * It automatically maps and unmaps rbd devices for you on reboot
          * It uses some wrappers to suppress SCSI meta-calls that cause
            unnecessary false-positive warnings in SM log (rbd devices don't have
            SCSI ID)
          * It automatically creates RBD device for you with compatibility
            settings known to work fine with XCP-ng dom0 kernel
          * It uses its custom LVM config file, so you don't need to expose RBD
            LVM structure to hypervisor, or override the default config file
          1 Reply Last reply Reply Quote 0
          • psafontP Offline
            psafont Vates 🪐 XAPI & Network Team @benapetr
            last edited by psafont

            @benapetr This is driven by hacky logic from 16 years ago:

            • on revert, unserialize the previous state, and update the VM record with its saved values. As we do not want to modify that each time we add a field in the datamodel, use some low-level database functions to iterate over the fields of a record. Not very nice as it makes some assumptions on the database layer, but seems to work allright and I don't think that database layer will change a lot in the future.

            I think it might be a good idea to add a revert rpc call to the storage interface that xapi can call to, with a backup to use the current logic if necessary; xapi should be able to clean up the database afterwards. I'll ask other maintainers about this or possible alternatives, but since SMAPIv1 is considered deprecated, I doubt it will happen.

            I have to say that SMAPIv3 was finally fixed upstream on June by Xenserver (migrations were finally done!) and XCP-ng should get the update that fixes it in the coming weeks. Given this, I would encourage you to take all the learnings you've acquired while doing the driver and porting it to SMAPIv3. SMAPIv1 just simply has too many problems, some of them are architectural, so in general xenserver and xcp-ng maintainers would like to see it finally go away.

            for now I am still targetting XCP-ng 8.2 as that's what I use in production, and I haven't seen many SMAPIv3 drivers there.

            8.2 is out of support for xenserver, and for xcp-ng yesterday was the last day it was supported, you really should update 😛

            B 1 Reply Last reply Reply Quote 2
            • B Offline
              benapetr @psafont
              last edited by

              @psafont thanks for the reply, but isn't that 16 year old logic part of XAPI? I mean - this same hacky logic is present in SMAPIv3 isn't it?

              I was going through SMAPIv3 docs and from SM driver perspective (feature-wise) it doesn't seem much different, it looks to me more like many cosmetic changes that make packaging and modularization easier (definitely a good thing), but don't really change any fundamental SM logic - the RPCs are all same as in SMAPIv1, even porting my own driver is probably going to be pretty trivial, it's just about splitting it into multiple files and add some wrappers around it, but it still won't solve my problem - the rollback RPC is just not there, so I would need to instead support this "rollback by making another snapshot of a snapshot" logic enforced by XAPI

              psafontP 1 Reply Last reply Reply Quote 0
              • psafontP Offline
                psafont Vates 🪐 XAPI & Network Team @benapetr
                last edited by

                @benapetr You're right. Unfortunately, there's no VDI revert that allows the revert to happen '. This is shown in the documentation: https://xapi-project.github.io/new-docs/toolstack/features/snapshots/index.html (see revert section)

                There's an old proposal to do add this: https://xapi-project.github.io/new-docs/design/snapshot-revert/index.html

                But the effort fizzed out because currently the imports do not set the snapshot_of correctly, and the operation needs to work even if the field is not set correctly, as it is now. (falling back to the current code seems sensible) https://github.com/xapi-project/xen-api/pull/2058

                This needs some effort to get fixed, I'll set up some ticketing so it can be prioritized accordingly.

                djs55 opened this pull request in xapi-project/xen-api

                closed VDI.revert pull request + extra bits #2058

                Maelstrom96M J 2 Replies Last reply Reply Quote 2
                • Maelstrom96M Offline
                  Maelstrom96 @psafont
                  last edited by

                  This post is deleted!
                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    The author of the main recent effort is basically the person who posted just before you 😉

                    Maelstrom96M 1 Reply Last reply Reply Quote 0
                    • Maelstrom96M Offline
                      Maelstrom96 @olivierlambert
                      last edited by

                      @olivierlambert you basically replied just after that I noticed that and deleted my message... 🙃

                      1 Reply Last reply Reply Quote 0
                      • J Offline
                        JamesG @psafont
                        last edited by

                        Nothing fruitful to add....

                        But...

                        Oooof....

                        This will be somewhat messy to clean up. I'm rooting for you guys though!!

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by

                          There's some nice progress on @psafont's work regarding improved revert. I'm confident we'll get there 🙂

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post