XOSTOR hyperconvergence preview

JeffBerntsen

@olivierlambert
After some more playing, I'm beginning to see some issues.

One is that, I think, running HA on top of it definitely causes some issues. After setting up HA and setting it up to use XOSTOR as the state/heartbeat SR, things appeared to work but since doing that, all three servers experienced crashes in pretty short succession after running successfully for several hours but have run successfully for a day or two since then. While running with HA enabled and using XOSTOR, the logs fill up with drbd state change messages for the xcp-persistent-ha-statefile.

If you're interested, I can gather up logs covering the day of the crashes or any other information you might want. Just let me know what you'd want for that and I'll be happy to collect it for you.

I've since disabled HA again and expect the servers will be as stable as they were before I enabled it (very stable from what I've seen so far except for the experiment with HA).

I suspect that HA will probably also work just fine as long as XOSTOR is not used for the heartbeat/metadata SR for it. This pool is also set up with a shared NFS v4 SR and I could experiment with using HA with that as the heartbeat SR but still using XOSTOR to house the VMs.

olivierlambert

Hmm weird, I suppose this will be great to have more details for @ronan-a

In theory, we should have HA working with LINSOT.

JeffBerntsen

@olivierlambert
No problem. Just let me know or have him let me know what he needs for information and I'll try to get it to you folks somehow.

The actual crashes seem to be related to fencing of machines and recovery after I intentionally forced an outage on one of the test servers. My test was making sure that the LINSTOR controller service was running on the server acting as the pool master and hosting a couple of the several test VMs with HA enabled. Then I forced a failure by pulling its power cord.

The servers recovered on their own from that with one of the others in the pool taking over as pool master and that one and the other trying to restart failed VMs, mostly successfully. The unsuccessful startups were either due to a lack of RAM or due to not being able to find the VDI for the VM. The latter problem went away on its own and I suspect that was due to HA not waiting long enough for LINSTOR to straighten out the storage situation before trying to restart the VM.

After that I restarted the "failed" server and it came up, rejoined the pool, and I was able to start VMs on it. As far as I can see, it looks like the servers crashed and mostly recovered on their own shortly after that.

That was three days ago and the pool has been up and running since then without problems with HA enabled. I've since disabled it after looking at the logs associated with the crash and seeing that having HA running with it's heartbeat storage running on top of LINSTOR was causing the logs to fill up at a rate of several lines per second.

olivierlambert

@ronan-a will come Monday and ask you questions probably

olivierlambert

Or Wednesday in fact, but he'll answer very soon

elialum

@olivierlambert said in XOSTOR hyperconvergence preview:

shared=true device-config:provisioning=thin

Interesting... Thank you for this implementation, looks promising.

Out of curiosity, can we use this "technology" instead of XOA's CR job? So, for a single host setup, with 1 Local disk, and for example a second network iscsi/nas drive, can we use xostor to clone the data in the background?

This is more a backup solution rather then HA setup, can can assure up-to-date data in the iscsi device

ronan-a

@jeffberntsen Hello, so I'm available. ^^

If you're interested, I can gather up logs covering the day of the crashes or any other information you might want. Just let me know what you'd want for that and I'll be happy to collect it for you.

Yes! Could you send me your logs? (Old XCP-ng logs: xha.log, daemon.log, SMlog, kern.log..., and the logs of LINSTOR: /var/log/linstor-{controller/satellite})

The actual crashes seem to be related to fencing of machines and recovery after I intentionally forced an outage on one of the test servers. My test was making sure that the LINSTOR controller service was running on the server acting as the pool master and hosting a couple of the several test VMs with HA enabled. Then I forced a failure by pulling its power cord.

It can be a bad delay during the restart of the linstor-controller or a bad sync in the DRBD layer.
I have already observed a long delay with a similar test. But we would still have to check the logs.

In this situation you can execute this command where the current linstor controller is running: linstor resource list. It's useful to check the current state.

That was three days ago and the pool has been up and running since then without problems with HA enabled. I've since disabled it after looking at the logs associated with the crash and seeing that having HA running with it's heartbeat storage running on top of LINSTOR was causing the logs to fill up at a rate of several lines per second.

Yeah, we are aware of this problem. We have discussed with the linbit team to reduce the verbosity of the DRBD logs, and there is a new patch to test in the next CH release to compress log files more often. It would be interesting to reduce the space usage of /var/log.

JeffBerntsen

@ronan-a said in XOSTOR hyperconvergence preview:

@jeffberntsen Hello, so I'm available. ^^

If you're interested, I can gather up logs covering the day of the crashes or any other information you might want. Just let me know what you'd want for that and I'll be happy to collect it for you.

Yes! Could you send me your logs? (Old XCP-ng logs: xha.log, daemon.log, SMlog, kern.log..., and the logs of LINSTOR: /var/log/linstor-{controller/satellite})

Absolutely. I've grabbed all logs from the system including the XCP-ng crash log folder from the day I ran the test and a few days before and after. I've got .tar.gz files of the contents of the logs folders from each of the three servers in my test pool covering that period, about 250MB of compressed files total. What would be the best way to get them to you?

It can be a bad delay during the restart of the linstor-controller or a bad sync in the DRBD layer.
I have already observed a long delay with a similar test. But we would still have to check the logs.

In this situation you can execute this command where the current linstor controller is running: linstor resource list. It's useful to check the current state.

I did that after everything came back up on its own and that reported all resources as up and healthy.

Something I noticed is that the linstor command only works on the host running as linstor controller at the time as the cli is looking for the controller running on localhost.

I think the delay in my case was the controller coming back up on a different host. I didn't see any sign of a bad sync in DRBD. (I've used DRBD on and off quite a bit so have some experience with that but have very little with LINSTOR).

Yeah, we are aware of this problem. We have discussed with the linbit team to reduce the verbosity of the DRBD logs, and there is a new patch to test in the next CH release to compress log files more often. It would be interesting to reduce the space usage of /var/log.

I'm pretty sure that's just related to HA from what I could see. You've obviously worked with it more than I have so please correct me if I'm wrong but it looks like LINSTOR tries to switch the active copy of the data to whichever system tries to write to the resource at the time and in HA, all of the servers in the pool are constantly trying to write to the HA metadata and heartbeat VDIs, driving LINSTOR crazy trying to keep up. As far as I can see, that doesn't happen with normal VM use because they're normally opened, read, and written by just one system at a time.

ronan-a

@jeffberntsen

Absolutely. I've grabbed all logs from the system including the XCP-ng crash log folder from the day I ran the test and a few days before and after. I've got .tar.gz files of the contents of the logs folders from each of the three servers in my test pool covering that period, about 250MB of compressed files total. What would be the best way to get them to you?

You can upload it where you want. Then you can send me a private message with the download link. Thank you.

I did that after everything came back up on its own and that reported all resources as up and healthy.

Something I noticed is that the linstor command only works on the host running as linstor controller at the time as the cli is looking for the controller running on localhost.

You can use this command linstor --controllers=<HOSTNAME_OR_IP> resource list when you are on another host. Note: The linstor-controller service is automatically started from a specific smapi daemon: minidrbdcluster because we want to detect at any time host crash or reboot and start a new controller if necessary. Also the LINSTOR DB is shared using a VDI, so the controller service must always be executed by XCP-ng and not a user.

I think the delay in my case was the controller coming back up on a different host. I didn't see any sign of a bad sync in DRBD. (I've used DRBD on and off quite a bit so have some experience with that but have very little with LINSTOR).
I'm pretty sure that's just related to HA from what I could see. You've obviously worked with it more than I have so please correct me if I'm wrong but it looks like LINSTOR tries to switch the active copy of the data to whichever system tries to write to the resource at the time and in HA, all of the servers in the pool are constantly trying to write to the HA metadata and heartbeat VDIs, driving LINSTOR crazy trying to keep up. As far as I can see, that doesn't happen with normal VM use because they're normally opened, read, and written by just one system at a time.

Very good analysis on your part. Indeed this VDI is shared, and DRBD prevents us from opening it on several hosts at once. We haven't found a better solution than to open, write and close it for the moment. So it's why we must reduce the spam in the log files with few patches.

JeffBerntsen

@ronan-a said in XOSTOR hyperconvergence preview:

You can upload it where you want. Then you can send me a private message with the download link. Thank you.

Done. Let me know if you have any problems getting to it.

You can use this command linstor --controllers=<HOSTNAME_OR_IP> resource list when you are on another host. Note: The linstor-controller service is automatically started from a specific smapi daemon: minidrbdcluster because we want to detect at any time host crash or reboot and start a new controller if necessary. Also the LINSTOR DB is shared using a VDI, so the controller service must always be executed by XCP-ng and not a user.

A little reading around in the LINSTOR documentation eventually helped me out with this. It's possible to set up an
environment variable LS_CONTROLLERS with a list of possible
controller machines and the linstor CLI command will try all of the servers on the list until it finds the controller. On the three servers in my test pool, I can do something when I first get into a shell like LS_CONTROLLERS=server1,server2,server3 and as long as the three server names can be resolved on all three hosts either via DNS or because they're in the /etc/hosts file, the linstor command works from any of them no matter which one is the controller.

JeffBerntsen

@olivierlambert said in XOSTOR hyperconvergence preview:

Play with it, snapshot, backup whatever What matters is resiliency.

More playing: I've just installed the latest set of updates as a rolling pool update and it handled things fine. No problems with XOSTOR shifting the VMs around during the update and no apparent problems afterward.

olivierlambert

That's a great test indeed I have to say I'm impressed, maybe it's because I'm so used of corner cases I tested for month triggering various issues, but every time @ronan-a came with a solution. Kudos to him!

Maelstrom96

@olivierlambert This looks very promising. We're currently running K8s on top of XCP-ng hosts and deploying everything through XOA with terraform adapters. It's been working well for us, but we're not using a shared SR which we're looking into deploying. The nice thing is that it looks like we could actually use the LINTSTORE directly from K8s, removing a two storage layers completely (OpenEBS + soft RAID 5 local SR), and making the whole thing work even better for both XCP-ng and K8s.

I have a question before trying to deploy this - how would we go about changing the SR adapter in cases we need to add, remove or replace a XCP-ng host? Should we be able to change the SR configuration while it's active?

olivierlambert

Likely a question for @ronan-a

edit: however, I'd love to have a chat with you to discuss your existing k8s workflow with XCP-ng/XOA!

ronan-a

@maelstrom96

I have a question before trying to deploy this - how would we go about changing the SR adapter in cases we need to add, remove or replace a XCP-ng host? Should we be able to change the SR configuration while it's active?

Well, a LINSTOR SR can be updated with new/deleted hosts. For the moment we don't have a script to simplify this usage, but with few linstor and smapi commands, you can do that.

Maelstrom96

@olivierlambert Feel free to email me at alexandre@floatplane.com.

@ronan-a I'll be deploying a test cluster this week and see if I can figure out the proper commands to perform those actions. Regarding linstor GUI, it seems like it's only supported on a controller, would that mean that I should install it on the cluster master DOM0?

olivierlambert

Duly noted thanks

Maelstrom96

@ronan-a I'm really not sure what I'm doing wrong, but I can't seem to be able to make it work at all :

[20:27 xostor1 log]# rpm -qa | grep -E "^(sm|xha)-.*linstor.*"
sm-2.30.4-1.1.0.linstor.8.xcpng8.2.x86_64
xha-10.1.0-2.2.0.linstor.1.xcpng8.2.x86_64
[20:27 xostor1 log]# lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdb                                 8:16   0   200G  0 disk
├─linstor_group-thin_device_tdata 252:1    0 399.8G  0 lvm
│ └─linstor_group-thin_device     252:2    0 399.8G  0 lvm
└─linstor_group-thin_device_tmeta 252:0    0   100M  0 lvm
  └─linstor_group-thin_device     252:2    0 399.8G  0 lvm
sr0                                11:0    1  14.5M  0 rom
sdc                                 8:32   0   200G  0 disk
└─linstor_group-thin_device_tdata 252:1    0 399.8G  0 lvm
  └─linstor_group-thin_device     252:2    0 399.8G  0 lvm
sda                                 8:0    0   100G  0 disk
└─md127                             9:127  0   100G  0 raid1
  ├─md127p5                       259:3    0     4G  0 md    /var/log
  ├─md127p3                       259:2    0   512M  0 md
  ├─md127p1                       259:0    0    18G  0 md    /
  ├─md127p6                       259:4    0     1G  0 md    [SWAP]
  └─md127p2                       259:1    0    18G  0 md

[20:27 xostor1 log]# xe sr-create type=linstor name-label=main-r3 host-uuid=8832105e-d307-45de-bcc3-6d61bb299dd4 device-config:hosts=xostor1,xostor2,xostor3 device-config:group-name=linstor_group/thin_device device-config:redundancy=1 shared=true device-config:provisioning=thin
Error code: SR_BACKEND_FAILURE_5006
Error parameters: , LINSTOR SR creation error [opterr=Not enough online hosts],
[20:28 xostor1 log]# ping xostor1
PING xostor1.floatplane.com (10.5.0.11) 56(84) bytes of data.
64 bytes from xostor1.floatplane.com (10.5.0.11): icmp_seq=1 ttl=64 time=0.048 ms
64 bytes from xostor1.floatplane.com (10.5.0.11): icmp_seq=2 ttl=64 time=0.054 ms
--- xostor1.floatplane.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.048/0.051/0.054/0.003 ms
[20:28 xostor1 log]# ping xostor2
PING xostor2.floatplane.com (10.5.0.12) 56(84) bytes of data.
64 bytes from xostor2.floatplane.com (10.5.0.12): icmp_seq=1 ttl=64 time=2.48 ms
64 bytes from xostor2.floatplane.com (10.5.0.12): icmp_seq=2 ttl=64 time=1.86 ms
--- xostor2.floatplane.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.865/2.177/2.489/0.312 ms
[20:28 xostor1 log]# ping xostor3
PING xostor3.floatplane.com (10.5.0.13) 56(84) bytes of data.
64 bytes from xostor3.floatplane.com (10.5.0.13): icmp_seq=1 ttl=64 time=2.55 ms
64 bytes from xostor3.floatplane.com (10.5.0.13): icmp_seq=2 ttl=64 time=1.25 ms
--- xostor3.floatplane.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.256/1.904/2.553/0.649 ms
[20:28 xostor1 log]# linstor resource list                                                                                                                                                   
Error: Unable to connect to linstor://localhost:3370: [Errno 99] Cannot assign requested address

I've run ./install --disks /dev/sdb /dev/sdc --thin on every host, starting with the master and when I tried to run the SR create command, it did the error that you can see in the logs. Your input would be greatly appreciated. I'll try other things to see if I can figure it out in the meantime.

Edit: Here are some logs from /var/log/xensource.log

Feb  1 20:37:23 xostor1 xapi: [ info||3640 /var/lib/xcp/xapi||cli] xe sr-create type=linstor name-label=MAIN3 host-uuid=8832105e-d307-45de-bcc3-6d61bb299dd4 device-config:hosts=xostor1,floatplane.com,xostor2.floatplane.com,xostor3.floatplane.com device-config:group-name=linstor_group/thin_device device-config:redundancy=3 shared=true device-config:provisioning=thin username=root password=(omitted)
Feb  1 20:37:23 xostor1 xapi: [ info||3640 /var/lib/xcp/xapi|session.login_with_password D:c5de2bbfe3f2|xapi_session] Session.create trackid=5d48b0e671ba2d39d6df368ab040b146 pool=false uname=root originator=cli is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
Feb  1 20:37:23 xostor1 xapi: [debug||3641 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:d9cfba7ee497 created by task D:c5de2bbfe3f2
Feb  1 20:37:23 xostor1 xapi: [debug||3640 /var/lib/xcp/xapi|SR.create R:cdb8dee1e91a|audit] SR.create: name label = 'MAIN3'
Feb  1 20:37:23 xostor1 xapi: [debug||3640 /var/lib/xcp/xapi|SR.create R:cdb8dee1e91a|xapi_sr] SR.create name_label=MAIN3 sm_config=[  ]
Feb  1 20:37:23 xostor1 xapi: [debug||3640 /var/lib/xcp/xapi|SR.create R:cdb8dee1e91a|mux] register SR 92f4aa20-ff55-65d9-e343-84f3f7beb552 (currently-registered = [ 049abfb4-c910-75b7-02e8-902dd68799d2, 92f4aa20-ff55-65d9-e343-84f3f7beb552, 9d382464-3747-43cd-ddee-8ea1a8e5f71a, 4a4483a7-d868-5e82-fac6-47789633a691, 46ca2df2-2d83-ea0d-d7b1-7e2ee6aee261, f26a3b3e-a1e7-aad6-690f-6ab15b8713b7, 3513da18-66e7-9f77-bfde-b9ca51473a63, f62fe1f8-3fcf-8b9f-b1c9-c4ea4ad692c9, 94483019-536a-4e10-5429-b0939499637f, 466c8387-99e3-bf72-6f0b-69a3bd4eb4a9, 95672541-2c5e-30c9-769a-1f2ccdb9390d ])
Feb  1 20:37:23 xostor1 xapi: [debug||3648 ||dummytaskhelper] task SR.create D:0d99844a7015 created by task R:cdb8dee1e91a
Feb  1 20:37:23 xostor1 xapi: [debug||3648 ||sm] SM linstor sr_create sr=OpaqueRef:c02a4a76-327c-45f9-b7aa-322ddd367eeb size=0
Feb  1 20:37:23 xostor1 xapi: [ info||3648 |sm_exec D:87fb98701fa6|xapi_session] Session.create trackid=b1cc8b6169d2afa2c0406ef60833efb5 pool=false uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
Feb  1 20:37:23 xostor1 xapi: [debug||3649 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:17ddb8970480 created by task D:87fb98701fa6
Feb  1 20:37:23 xostor1 xapi: [ warn||3639 HTTPS 10.2.0.5->:::80|event.from D:2d93a68de684|xapi_message] get_since_for_events: no in_memory_cache!
Feb  1 20:37:23 xostor1 xapi: [debug||3650 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.logout D:fe0e500419b8 created by task D:b770fe324145
Feb  1 20:37:23 xostor1 xapi: [ info||3650 /var/lib/xcp/xapi|session.logout D:379f9df26b48|xapi_session] Session.destroy trackid=e1378d042bc7e7f579b63e67ba555677
Feb  1 20:37:23 xostor1 xapi: [debug||227 ||xenops] Event on VM 5378d6b6-e759-44ca-8cde-9ae83151dc60; resident_here = true
Feb  1 20:37:23 xostor1 xapi: [debug||3651 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.slave_login D:5ce5e24c056b created by task D:b770fe324145
Feb  1 20:37:23 xostor1 xapi: [ info||3651 /var/lib/xcp/xapi|session.slave_login D:82827ae03cfd|xapi_session] Session.create trackid=99296cff5c22655aa4cf7b15a343cabc pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49
Feb  1 20:37:23 xostor1 xapi: [debug||227 ||dummytaskhelper] task timeboxed_rpc D:b0c8d8711947 created by task D:5fd006508046
Feb  1 20:37:23 xostor1 xapi: [debug||3652 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:268f2a0db4e5 created by task D:82827ae03cfd
Feb  1 20:37:23 xostor1 xapi: [debug||3653 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:event.from D:53729b99cb8e created by task D:5fd006508046
Feb  1 20:37:23 xostor1 xapi: [debug||3654 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:event.from D:92735847af98 created by task D:b770fe324145
Feb  1 20:37:23 xostor1 xapi: [ warn||3656 HTTPS 10.2.0.5->:::80|event.from D:8c1fe22ca8b2|xapi_message] get_since_for_events: no in_memory_cache!
Feb  1 20:37:23 xostor1 xapi: [debug||3657 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host.get_other_config D:b9a7457e385e created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3658 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.get_sm_config D:fc7430bb3d2f created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3659 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.get_all_records_where D:5422781d1f22 created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3660 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host.get_all_records D:7e1925a42dd6 created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3661 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host_metrics.get_record D:43e422635fef created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3662 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host_metrics.get_record D:cd6eb6efca81 created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [debug||3663 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:host_metrics.get_record D:72bd45336bdd created by task D:0d99844a7015
Feb  1 20:37:23 xostor1 xapi: [ info||3648 |sm_exec D:87fb98701fa6|xapi_session] Session.destroy trackid=b1cc8b6169d2afa2c0406ef60833efb5
Feb  1 20:37:23 xostor1 xapi: [error||3648 ||backtrace] sm_exec D:87fb98701fa6 failed with exception Storage_error ([S(Backend_error);[S(SR_BACKEND_FAILURE_5006);[S();S(LINSTOR SR creation error [opterr=Not enough online hosts]);S()]]])
Feb  1 20:37:23 xostor1 xapi: [error||3648 ||backtrace] Raised Storage_error ([S(Backend_error);[S(SR_BACKEND_FAILURE_5006);[S();S(LINSTOR SR creation error [opterr=Not enough online hosts]);S()]]])
Feb  1 20:37:23 xostor1 xapi: [error||3648 ||backtrace] 1/8 xapi Raised at file ocaml/xapi/sm_exec.ml, line 377

Maelstrom96

After reading the sm LinstorSR file, I figured out the hosts names need to exactly match the hosts names in the XCP-ng pool. I thought I tried that and that it failed the same way, but after re-trying with all valid hosts, it setup the SR correctly.

Something I've also noticed in the code is that it seems like there's not a way to deploy a secondary SR connectted to the same lintstor controller that could have a different replication factor. For some VMs that have built-in software replication/HA, like DBs, it might be prefered to have replication=1 set for the VDI.

ronan-a

@maelstrom96 Hello,

Something I've also noticed in the code is that it seems like there's not a way to deploy a secondary SR connectted to the same lintstor controller that could have a different replication factor.

For the moment yes, you can only use one LinstorSR in a pool. Ideally we would like to modify the driver to support several SRs, perhaps during a rewrite of the driver in the latest version of the smapi.

For some VMs that have built-in software replication/HA, like DBs, it might be prefered to have replication=1 set for the VDI.

We can authorize this behavior without having other SRs. It would suffice to pass a replication parameter for this particular VDI when it is created. So thank you for this feedback. I think we must implement this use case for the future.