XOSTOR hyperconvergence preview

olivierlambert

That's a great test indeed I have to say I'm impressed, maybe it's because I'm so used of corner cases I tested for month triggering various issues, but every time @ronan-a came with a solution. Kudos to him!

Maelstrom96

@olivierlambert This looks very promising. We're currently running K8s on top of XCP-ng hosts and deploying everything through XOA with terraform adapters. It's been working well for us, but we're not using a shared SR which we're looking into deploying. The nice thing is that it looks like we could actually use the LINTSTORE directly from K8s, removing a two storage layers completely (OpenEBS + soft RAID 5 local SR), and making the whole thing work even better for both XCP-ng and K8s.

I have a question before trying to deploy this - how would we go about changing the SR adapter in cases we need to add, remove or replace a XCP-ng host? Should we be able to change the SR configuration while it's active?

olivierlambert

Likely a question for @ronan-a

edit: however, I'd love to have a chat with you to discuss your existing k8s workflow with XCP-ng/XOA!

ronan-a

@maelstrom96

I have a question before trying to deploy this - how would we go about changing the SR adapter in cases we need to add, remove or replace a XCP-ng host? Should we be able to change the SR configuration while it's active?

Well, a LINSTOR SR can be updated with new/deleted hosts. For the moment we don't have a script to simplify this usage, but with few linstor and smapi commands, you can do that.

Maelstrom96

@olivierlambert Feel free to email me at alexandre@floatplane.com.

@ronan-a I'll be deploying a test cluster this week and see if I can figure out the proper commands to perform those actions. Regarding linstor GUI, it seems like it's only supported on a controller, would that mean that I should install it on the cluster master DOM0?

olivierlambert

Duly noted thanks

Maelstrom96

@ronan-a I'm really not sure what I'm doing wrong, but I can't seem to be able to make it work at all :

[20:27 xostor1 log]# rpm -qa | grep -E "^(sm|xha)-.*linstor.*"
sm-2.30.4-1.1.0.linstor.8.xcpng8.2.x86_64
xha-10.1.0-2.2.0.linstor.1.xcpng8.2.x86_64
[20:27 xostor1 log]# lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdb                                 8:16   0   200G  0 disk
├─linstor_group-thin_device_tdata 252:1    0 399.8G  0 lvm
│ └─linstor_group-thin_device     252:2    0 399.8G  0 lvm
└─linstor_group-thin_device_tmeta 252:0    0   100M  0 lvm
  └─linstor_group-thin_device     252:2    0 399.8G  0 lvm
sr0                                11:0    1  14.5M  0 rom
sdc                                 8:32   0   200G  0 disk
└─linstor_group-thin_device_tdata 252:1    0 399.8G  0 lvm
  └─linstor_group-thin_device     252:2    0 399.8G  0 lvm
sda                                 8:0    0   100G  0 disk
└─md127                             9:127  0   100G  0 raid1
  ├─md127p5                       259:3    0     4G  0 md    /var/log
  ├─md127p3                       259:2    0   512M  0 md
  ├─md127p1                       259:0    0    18G  0 md    /
  ├─md127p6                       259:4    0     1G  0 md    [SWAP]
  └─md127p2                       259:1    0    18G  0 md

[20:27 xostor1 log]# xe sr-create type=linstor name-label=main-r3 host-uuid=8832105e-d307-45de-bcc3-6d61bb299dd4 device-config:hosts=xostor1,xostor2,xostor3 device-config:group-name=linstor_group/thin_device device-config:redundancy=1 shared=true device-config:provisioning=thin
Error code: SR_BACKEND_FAILURE_5006
Error parameters: , LINSTOR SR creation error [opterr=Not enough online hosts],
[20:28 xostor1 log]# ping xostor1
PING xostor1.floatplane.com (10.5.0.11) 56(84) bytes of data.
64 bytes from xostor1.floatplane.com (10.5.0.11): icmp_seq=1 ttl=64 time=0.048 ms
64 bytes from xostor1.floatplane.com (10.5.0.11): icmp_seq=2 ttl=64 time=0.054 ms
--- xostor1.floatplane.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.048/0.051/0.054/0.003 ms
[20:28 xostor1 log]# ping xostor2
PING xostor2.floatplane.com (10.5.0.12) 56(84) bytes of data.
64 bytes from xostor2.floatplane.com (10.5.0.12): icmp_seq=1 ttl=64 time=2.48 ms
64 bytes from xostor2.floatplane.com (10.5.0.12): icmp_seq=2 ttl=64 time=1.86 ms
--- xostor2.floatplane.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.865/2.177/2.489/0.312 ms
[20:28 xostor1 log]# ping xostor3
PING xostor3.floatplane.com (10.5.0.13) 56(84) bytes of data.
64 bytes from xostor3.floatplane.com (10.5.0.13): icmp_seq=1 ttl=64 time=2.55 ms
64 bytes from xostor3.floatplane.com (10.5.0.13): icmp_seq=2 ttl=64 time=1.25 ms
--- xostor3.floatplane.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.256/1.904/2.553/0.649 ms
[20:28 xostor1 log]# linstor resource list                                                                                                                                                   
Error: Unable to connect to linstor://localhost:3370: [Errno 99] Cannot assign requested address

I've run ./install --disks /dev/sdb /dev/sdc --thin on every host, starting with the master and when I tried to run the SR create command, it did the error that you can see in the logs. Your input would be greatly appreciated. I'll try other things to see if I can figure it out in the meantime.

Edit: Here are some logs from /var/log/xensource.log

Feb  1 20:37:23 xostor1 xapi: [ info||3640 /var/lib/xcp/xapi||cli] xe sr-create type=linstor name-label=MAIN3 host-uuid=8832105e-d307-45de-bcc3-6d61bb299dd4 device-config:hosts=xostor1,floatplane.com,xostor2.floatplane.com,xostor3.floatplane.com device-config:group-name=linstor_group/thin_device device-config:redundancy=3 shared=true device-config:provisioning=thin username=root password=(omitted)
Feb  1 20:37:23 xostor1 xapi: [ info||3640 /var/lib/xcp/xapi|session.login_with_password D:c5de2bbfe3f2|xapi_session] Session.create trackid=5d48b0e671ba2d39d6df368ab040b146 pool=false uname=root originator=cli is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
Feb  1 20:37:23 xostor1 xapi: [debug||3641 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:d9cfba7ee497 created by task D:c5de2bbfe3f2
Feb  1 20:37:23 xostor1 xapi: [debug||3640 /var/lib/xcp/xapi|SR.create R:cdb8dee1e91a|audit] SR.create: name label = 'MAIN3'
Feb  1 20:37:23 xostor1 xapi: [debug||3640 /var/lib/xcp/xapi|SR.create R:cdb8dee1e91a|xapi_sr] SR.create name_label=MAIN3 sm_config=[  ]
Feb  1 20:37:23 xostor1 xapi: [debug||3640 /var/lib/xcp/xapi|SR.create R:cdb8dee1e91a|mux] register SR 92f4aa20-ff55-65d9-e343-84f3f7beb552 (currently-registered = [ 049abfb4-c910-75b7-02e8-902dd68799d2, 92f4aa20-ff55-65d9-e343-84f3f7beb552, 9d382464-3747-43cd-ddee-8ea1a8e5f71a, 4a4483a7-d868-5e82-fac6-47789633a691, 46ca2df2-2d83-ea0d-d7b1-7e2ee6aee261, f26a3b3e-a1e7-aad6-690f-6ab15b8713b7, 3513da18-66e7-9f77-bfde-b9ca51473a63, f62fe1f8-3fcf-8b9f-b1c9-c4ea4ad692c9, 94483019-536a-4e10-5429-b0939499637f, 466c8387-99e3-bf72-6f0b-69a3bd4eb4a9, 95672541-2c5e-30c9-769a-1f2ccdb9390d ])
Feb  1 20:37:23 xostor1 xapi: [debug||3648 ||dummytaskhelper] task SR.create D:0d99844a7015 created by task R:cdb8dee1e91a
Feb  1 20:37:23 xostor1 xapi: [debug||3648 ||sm] SM linstor sr_create sr=OpaqueRef:c02a4a76-327c-45f9-b7aa-322ddd367eeb size=0
Feb  1 20:37:23 xostor1 xapi: [ info||3648 |sm_exec D:87fb98701fa6|xapi_session] Session.create trackid=b1cc8b6169d2afa2c0406ef60833efb5 pool=false uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
Feb  1 20:37:23 xostor1 xapi: [debug||3649 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:17ddb8970480 created by task D:87fb98701fa6
Feb  1 20:37:23 xostor1 xapi: [ warn||3639 HTTPS 10.2.0.5->:::80|event.from D:2d93a68de684|xapi_message] get_since_for_events: no in_memory_cache!
Feb  1 20:37:23 xostor1 xapi: [debug||3650 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.logout D:fe0e500419b8 created by task D:b770fe324145
Feb  1 20:37:23 xostor1 xapi: [ info||3650 /var/lib/xcp/xapi|session.logout D:379f9df26b48|xapi_session] Session.destroy trackid=e1378d042bc7e7f579b63e67ba555677
Feb  1 20:37:23 xostor1 xapi: [debug||227 ||xenops] Event on VM 5378d6b6-e759-44ca-8cde-9ae83151dc60; resident_here = true
Feb  1 20:37:23 xostor1 xapi: [debug||3651 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.slave_login D:5ce5e24c056b created by task D:b770fe324145
Feb  1 20:37:23 xostor1 xapi: [ info||3651 /var/lib/xcp/xapi|session.slave_login D:82827ae03cfd|xapi_session] Session.create trackid=99296cff5c22655aa4cf7b15a343cabc pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
Feb  1 20:37:23 xostor1 xapi: [debug||227 ||dummytaskhelper] task timeboxed_rpc D:b0c8d8711947 created by task D:5fd006508046
Feb  1 20:37:23 xostor1 xapi: [debug||3652 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:268f2a0db4e5 created by task D:82827ae03cfd
Feb  1 20:37:23 xostor1 xapi: [debug||3653 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:event.from D:53729b99cb8e created by task D:5fd006508046
Feb  1 20:37:23 xostor1 xapi: [debug||3654 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:event.from D:92735847af98 created by task D:b770fe324145
Feb  1 20:37:23 xostor1 xapi: [ warn||3656 HTTPS 10.2.0.5->:::80|event.from D:8c1fe22ca8b2|xapi_message] get_since_for_events: no in_memory_cache!
Feb  1 20:37:23 xostor1 xapi: [debug||3657 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host.get_other_config D:b9a7457e385e created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3658 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.get_sm_config D:fc7430bb3d2f created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3659 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.get_all_records_where D:5422781d1f22 created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3660 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host.get_all_records D:7e1925a42dd6 created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3661 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host_metrics.get_record D:43e422635fef created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3662 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host_metrics.get_record D:cd6eb6efca81 created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3663 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host_metrics.get_record D:72bd45336bdd created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [ info||3648 |sm_exec D:87fb98701fa6|xapi_session] Session.destroy trackid=b1cc8b6169d2afa2c0406ef60833efb5
Feb  1 20:37:23 xostor1 xapi: [error||3648 ||backtrace] sm_exec D:87fb98701fa6 failed with exception Storage_error ([S(Backend_error);[S(SR_BACKEND_FAILURE_5006);[S();S(LINSTOR SR creation error [opterr=Not enough online hosts]);S()]]])
Feb  1 20:37:23 xostor1 xapi: [error||3648 ||backtrace] Raised Storage_error ([S(Backend_error);[S(SR_BACKEND_FAILURE_5006);[S();S(LINSTOR SR creation error [opterr=Not enough online hosts]);S()]]])
Feb  1 20:37:23 xostor1 xapi: [error||3648 ||backtrace] 1/8 xapi Raised at file ocaml/xapi/sm_exec.ml, line 377

Maelstrom96

After reading the sm LinstorSR file, I figured out the hosts names need to exactly match the hosts names in the XCP-ng pool. I thought I tried that and that it failed the same way, but after re-trying with all valid hosts, it setup the SR correctly.

Something I've also noticed in the code is that it seems like there's not a way to deploy a secondary SR connectted to the same lintstor controller that could have a different replication factor. For some VMs that have built-in software replication/HA, like DBs, it might be prefered to have replication=1 set for the VDI.

ronan-a

@maelstrom96 Hello,

Something I've also noticed in the code is that it seems like there's not a way to deploy a secondary SR connectted to the same lintstor controller that could have a different replication factor.

For the moment yes, you can only use one LinstorSR in a pool. Ideally we would like to modify the driver to support several SRs, perhaps during a rewrite of the driver in the latest version of the smapi.

For some VMs that have built-in software replication/HA, like DBs, it might be prefered to have replication=1 set for the VDI.

We can authorize this behavior without having other SRs. It would suffice to pass a replication parameter for this particular VDI when it is created. So thank you for this feedback. I think we must implement this use case for the future.

abufrejoval

Redhat has EOL'ed the oVirt downstream projects RHGS (Gluster storage) and RHV (virtualization orchestration) so I am looking for a new home.

So this first post is also about the philosophical differences, which you may find interesting.

What I found attractive about Gluster is that it's a file layered abstraction, that doesn't even implement a file system, but just uses the one below (e.g. ext4 or xfs) with a very smart overlay. It supports replicated or erasure coded dispersed storage at file level, but adds a chunking layer in case your files are in fact machines or databases and too big to ensure fair load distribution without. Unfortunately Gluster and oVirt were never properly aligned and the flexibility of Gluster never quite carried over into oVirt's HCI templates (e.g. dispersed volumes with dynamic growth in the number of bricks).

It was also decoupled in the sense, that not every HCI node needs to be the same size or even contribute all parts: should you have nodes with lots of storage but little compute and vice versa, you can have them contribute only those attractive parts. In my lab made of left-overs and some hot-shot elements, that was a good match.

LINSTOR is blocks and feels much more like a HCI-SAN. With Xcp-ng storage needs to "look" local, so either it really is, or you contribute at least parts. Actually, since the current setup seems to be full replica-only, you do always have a full copy of all blocks. I hope the big advantage vs. Gluster will be bullet-proof simplicity, perhaps even performance: Gluster had me bite my nails far too often, until healing had completed without issues.

Full replica mode obviously has a huge impact on performance. Since I am just testing currently, I am using a setup based fully on nested virtualization. So hosts are in fact VMs run on VMware workstation on a single machine with lots of RAM and NVMe storage.

Setup is unbelievably quick (compared to an oVirt 3node HCI install) and I quite enjoyed playing around with the provided and the self-compiled orchestrator: I quite like that the xoa is so fully stateless, that I can have one run as a VM inside and another say as a VM on a laptop's VirtualBox in case the former has gone titsup. An oVirt/RHV management engines that fail to start on an oVirt cluster is very scary, let me tell you!
It's typically there and then when you notice that your Postgres management database backup is more than a little stale.

I started with 5 nodes, wanting to go with the smallest dispersed configuration, but only replica seems supported for now. When VMs failed to start after having been moved to XOSAN, I got a little scared until I understood that the fifth node happened to be the one the VM was to be started on, and that didn't participate in XOSAN and therefore couldn't run the VM...

After its removal things behaved as expected: Seamlessly!

Performance is as you'd expect from the 4x replica: Write-performance on XOSAN is 25% of read performance, while with local storage the two are very much the same.

I very much dislike VDI being used to mean disks (far too many other usages), but I like the ease of moving them between storage repositories. Moving disks between storage domains or between different farms in oVirt was a major nightmare with plenty of bugs.

The Windows host I am using to run the "virtual bare metal" was actually running out of space while I was migrating a disk from local storage to XOSAN (meaning 5x storage required temporarily), which meant all hosts got suspended on write failures and I couldn't even log into the machine to create some room.

That is not quite the complete power loss on a data centre but still something that should never happen.

I got the super-host machine to restart via the power button (shutdown, not power cycle or reset) and fully expected some serious damage even if only the xoa VM was actually running...

But no, no damange was done. I could restart all nodes and the xoa VM.
While the XOSAN image copy was incomplete, the original local storage disk was still intact and after I had created the proper space the operation could just be repeated.

Such failure testing is a lot more fun using nested virtualization with consistent snapshots to recover from and that's what I'll be playing with a little more to gain confidence in the whole thing: so far it's looking very good!

Please let me know ASAP, when a dispersed option is available so I can start testing the really interesting variants!

And perhaps you could add a VDO (deduplication/LZ4 compression) option to the nodes? I used the full set of options on oVirt, VDO and thin allocation to ensure I got every last bit of space out of those pricey NVMe SSDs...

olivierlambert

Hi @abufrejoval

I truly loved the Gluster approach we used with XOSAN (one big filesystem, simple, robust), but sadly it's pretty slow in 4k random read/write

That's why we decided to keep the Gluster driver for people who wanted to keep using it, but focusing on block replication for XOSTOR, based on LINSTOR.

However, in the future, with the new storage stack (SMAPIv3), we might decide to re-bench all of this and provide another alternative, Gluster based

abufrejoval

@olivierlambert

Salut Olivier,

yes, I saw your Gluster support but it seems that Redhat is ready to let it die. I believe it had a major brain drain some time ago and suffered from lack of adoption and evolution since.

And quite honestly, when something isn't quite right, it can be nerve wrecking and very hard, if not impossible, to fix. In the most typical HCI scenario with 2 replicas and 1 arbiter I found myself recreating the replica and arbiter (can't be done in a single operation either) too often, when perhaps only a couple of blocks might have been really bad.

In short, they current state of Gluster isn't quite good enough and with Redhat discontinuing the commercial product, it's hard to believe it will ever get there.

But VDO (hint, hint!) is still there

olivierlambert

Yeah I did some tests on VDO, but I have to admit I only used it for compression I think What's the state of it from your perspective?

abufrejoval

@olivierlambert

Well, at least it doesn't seem to be one of Redhat's acquisitions that's EOL yet.

I've used it perhaps without thinking it through all too well on all my oVirt setups, mostly "because it's an option you can tick". It was only afterwards that I read that it shouldn't be used in certain use cases, but Redhat is far from consistent in its documentation.

I've searched for a studies/benchmark/recommendations and came up short, apart from a few vendor sponsored ones.

I support a team of ML researchers and they have massive amounts of highly compressible data sets, which they then compressed and de-compressed manually, often enough with both of them lying around afterwards.

So there my main intent was to just let them store the stuff how it's easiest to use, as plain visible data, and not worry about storage efficiency. There LZ4 code and the bit manipuliation support in today's CPUs seem to work faster than any NVMe storage and they use it mostly in large swaths of sequential pipes. In ML even GPU memory is far too slow for random access so I'm not concerned about storage IOPS.

It's actual use case seems to have come from VDI (that's virtual desktop infrastructure!) or Citrix' old battle ground, where tons of virtual desktop images might fill and bottleneck the best of SANs in a morning's boot storm.

Again, I like its smart approach, which isn't about trying to guarantee the complete elimination of all duplicate blocks. Instead it will eliminate the duplicates that it can find in its immediate reach within a fixed amount of time and effort, by doing a compression/checksum run on any block that is being evicted from cache to see if it's a duplicate already: compression pays for the effort and the hash delivers the dedup potential on top! Just cool!

So if you have 100 CentOS instances in your farm, there is no guarantee it will avoid you having duplicates of all code or indeed even eliminating a single one, because they might never be in the same cache on the same node or the same offset as they are being written (no lazy duplicate elimination going on the background).

And then it just very much depends on your use case, whether there will actually be any benefit, of if it's just needlesly spent CPU cycles and RAM.

Operationally it's wonderfully transparent on CentOS/RHEL and even fun to set up with Cockpit (not so much manually). When VDO volumes are consistent, they are also recoverable even without any external metadata e.g. from another machine, which is a real treat. I don't think that they record their hashes as part of the blocks they write, which could be great to deliver some ZFS like integrity checks.

But it just takes a single corrupted block, to make everything on top get unusable, which I've seen happen when an onboard defective RAID controller got swapped with a motherboard and data had only been committed to the BBU backed cache.

That's where Gluster helped out and why I think they might compensate a bit for each other with VDO compensating the write amplification of Gluster and Gluster the higher corruption risk of compressed data and opaque data structures.

And with Gluster underneath, it's never felt like the bottleneck was in VDO

abufrejoval

@olivierlambert

So my first XOASAN tests using nested virtual hosts were rather promising so I'd like to move to physical machines.

Adding extra disks on virtual hosts is obviously easy, but the NUCs I'm using for that only have one single NVMe drive each without any option to hide the majority of the space e.g. via a RAID controller. The installer just grabs all of that for the default local storage and I have nothing left for XOASAN.

It does mention an "advanced installation option" somewhere in those dialogs, but that never appears.

Any recommendation on how to either keep the installer from grabbing everything or shrink the local storage after a fresh install?

olivierlambert

Hi @abufrejoval

Just having a doubt: it's now called XOSTOR, so you tested XOSTOR right?

Regarding your installer question: you can uncheck the disk for VM storage, then it will leave free space on the disk after the installation

abufrejoval

@olivierlambert

sorry, XOSTORE of course

And why did I think I had already tried that?

Must be late...

Merci!

olivierlambert

You are welcome and thanks a lot for your tests! It's very important for us to have external users playing with it

dumarjo

Hi,

I have been able to create a small lab with 2 xcp-ngs and would like to use the XOSTOR. First I create a first xcp-ng box (xcp-ng-01) create a new XOSTOR with all the command above. All is working. I have a new SR with 1 host only.

Now If I want to add a new host to the SR how can I do this ? I would like to simulate adding new host/disk to the SR. Is it possible ?

abufrejoval

@olivierlambert

Hi, the 8.2.1 test five image wouldn't let me install XOSTOR yet (and it wasn't recommended): so is this release variant again compatible with XOSTOR?