XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XOSTOR hyperconvergence preview

    Scheduled Pinned Locked Moved XOSTOR
    446 Posts 47 Posters 479.0k Views 48 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • SwenS Offline
      Swen
      last edited by

      hi @ronan-a,
      we did some performance testing with the latest version and we run into a bottleneck we are unable to identify in detail.

      Here is our setup:
      Dell R730
      CPU: 2x Intel E5-2680v4
      RAM: 384GB
      Storage: 2x NVMe Samsung PM9A3 3.84TB via U.2 PCIe 3 x16 Extender Card
      NICs: 2x 10G Intel, 2x 40G Intel

      We have 3 servers with the same configuration and installed them as a cluster with replica count of 2.
      xcp-ng 8.2 with latest patches is installed. All servers are using the same switch (2x QFX5100-24Q, configured as virtual chassis). We are using a LACP bond on the 40G interfaces.

      When using the 10G interfaces (xcp-ng is using those interfaces as management interfaces) for linstor traffic we run into a cap on the nic bandwith of around 4 Gbit/s (500MB/s).
      When using the bonded 40G interfaces the cap is around 8 Gbit/s (1000MB/s)

      Only 1 VM is installed on the pool. We are using Ubuntu 22.04 LTS with latest updates installed from ISO using the template for Ubuntu 20.04.

      Here is the fio command we are using:
      fio --name=a --direct=1 --bs=1M --iodepth=32 --ioengine=libaio --rw=write --filename=/tmp/test.io --size=100G

      I would expect far more because we do not hit any known bottleneck of interfaces, NVMe or PCIe slot. Do I miss something? Is this expected performance? If not, any idea what the bottleneck is? Does anybody have some data we can compare with?

      regards,
      Swen

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        1. use iodepth of 128
        2. use 4 process at the same time (numjobs=4)
        3. use io_uring if you can in the guest (and not libaio)
        4. don't use a test file but bench directly on a non-formatted device (like /dev/xvdb), this removes the filesystem layer

        With those settings in fio, I can reach near 2600MiB/s in read, and 900MiB/s in write with 4x virtual disks in mdadm RAID0, in the guest (a test VM on Debian 12), on rather "old" Xeon CPUs and a PCIe 3 ports on an consumer grade NVMe SSD.

        Also, latest thing to know: if you use thin pro, you need to run the test twice, the first run (while the VHD is growing), it's always slower. And this is not a problem in real life, you can run twice or 3 times and check the result your tests, without counting the first.

        I'm about to get more recent hardware (except the NVMe) to re-run some tests this week. But as you can see, you can go over a 20G network (I'm using a 25G NIC)

        SwenS 1 Reply Last reply Reply Quote 0
        • SwenS Offline
          Swen @olivierlambert
          last edited by

          @olivierlambert thx for the feedback! I do not get how you see that you reach 20G on your nic. Can you please explain it? I see that you reach 2600MiB/s in read, but this is more likely on local disk, isn't? What I can see in our lab environment is that for what ever reason we do not get more than around 8Gbits on pass-through via a 40G interface and 4Gbits via a 10G interface and therefore we do not get any good performance out of the storage repository. I am unable to find the root cause of this. Do you have any idea where to look? I can see high waits on the OS of the VM, but no waits inside dom0 of any node.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            First, always do your storage speed test in a regular VM. The Dom0 doesn't matter: you won't run your workload in it, so test what's relevant to you: inside a dedicated VM.

            Also, it's not only a matter of network speed, but latency, DRBD, SSD speed any many other things. Only Optane drives or RAM are relevant to really push XOSTOR, because there's not a lot of NVMe that can sustain heavy writes without slowing down (especially on 100GiB file).

            But first, start to benchmark with the right fio parameters, and let's see 🙂

            SwenS 1 Reply Last reply Reply Quote 0
            • SwenS Offline
              Swen @olivierlambert
              last edited by Swen

              @olivierlambert just to be sure: we did also use your recommended fio parameters with the exact same results. We used fio from inside a VM not from inside dom0. My comments regarding waits inside the VM and no waits in dom0 was just additional information.

              I am aware of possible bottleneck like latency, SSD others, but in our case we can rule them out. Reason is that we double our speed when switching from 10G to 40G interface while the rest is the exact same configuration. As fsr as I can see this looks to me like xcp-ng is the bottleneck and limiting bandwidth of the interface in some way. Even the numbers you provided are not really good performance numbers. Did you get more bandwidth than 8 Gbits over the linstor interface?

              We are going to install Ubuntu on the same servers and install linstor on it to test our infrastructure on bare-metal without any hypervisor to see if it is xcp-ng related or not.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by olivierlambert

                Those are really good numbers for a replicated block system, on top of a virtualization solution.

                The fact you are doubling the speed isn't just about bandwidth, but also likely latency related. XOSTOR works in sync mode, so you have to wait for blocks to be written on the destination before getting the confirmation. You might try on bigger blocks to see if you can reach higher throughput.

                Also, remember that if you test in a VM on a single virtual disk, that's absolutely the bottleneck here (tapdisk). There's one process per disk, that's why I advise to test either with multiple VMs at the same time to really push XOSTOR to its limits, or create a big RAID0 with many virtual drives you can (however, first option is better because you can have VMs on multiple hosts at the same time).

                In short, the system scales with the number of VMs, not when benchmarking with one VM and one disk only.

                Finally, don't forget thin mode that is requiring to run the test at least twice to really see the performance. On your side, you are very likely CPU bound due to an 8 years old Intel CPU/architecture, which is not that's efficient. But on that, I'll be able to provide a real result comparing my test bench in Xeon vs Zen in 2 days.

                SwenS 1 Reply Last reply Reply Quote 0
                • G Offline
                  gb.123
                  last edited by

                  Is it possible to change replication factor later on the fly eg. after adding a new host (without loosing data) ?

                  ronan-aR 1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by olivierlambert

                    That's a question for @ronan-a when he'll be back 🙂

                    But in any case, IIRC, the preferred replication number for now is 2.

                    1 Reply Last reply Reply Quote 0
                    • ronan-aR Offline
                      ronan-a Vates 🪐 XCP-ng Team @gb.123
                      last edited by

                      @gb-123 You can use this command:

                      linstor resource-group modify xcp-sr-linstor_group_thin_device --place-count <NEW_COUNT>
                      

                      You can confirm the resource group to use with:

                      linstor resource-group list
                      

                      Ignore the default group named: DfltRscGrp and take the second.

                      Note: Don't use a replication count greater than 3.

                      G 1 Reply Last reply Reply Quote 2
                      • G Offline
                        gb.123
                        last edited by gb.123

                        @ronan-a

                        I have installed the XOSTOR with replica count of 2.
                        I have tried running one E-mail Server (PostFix) and one Web-Server (Nginx), with their VHD on XOSTOR. They seem to run fine. I have not done any benchmarks for now.

                        After installing, I noticed that the VxLAN (Encrypted) has stopped working.

                        Update:

                        I managed to fix the VxLAN by :

                        1. Completely Removing all PIFs of VxLAN and removing the VxLAN Network from the pool
                        2. Remove SDN controller configuration from plugin page > delete configuation
                        3. Shutdown ALL hosts from the pool
                        4. Restart XO VM
                        5. Enable SDN controller again (this time with Override Certs ON)
                        6. Click Save Configuration.
                        7. Start All Hosts
                        8. Create VxLAN again.

                        Previous Error Details (Now Solved as mentioned in the update above):

                        So this is how I was using:

                        1 VM on Node 1 which runs pfSense Router
                        VxLan (encrypted) on the pool.

                        Other VMs on Node 1 & Node 2.
                        VMs on Node1 are connecting to the pfSense which is running on the same node.
                        VMs on Node2 have stopped connecting.

                        I have not changed any settings. All I did for installing XOSTOR, was remove the previous SR (after migrating vhd to another sr using XO) edit partition (using fdisk) and then create 2 SRs (One for EXT4 and the Other for XOSTOR) and migrate all VHDs back to EXT4 and XOSTOR (as per requirement). I did this for both nodes.
                        I am not sure if installing XOSTOR has something to do with this or not so this is not a bug report (yet).

                        So far I have tried :

                        1. Turn off and turn on SDN Controller in XO
                        2. Override certificates in SDN controller in XO and rebooting both host nodes.

                        Do you think XOSTOR would have an impact on the MTU in the host Network Card ?

                        Any direction you can give me to diagnose the problem ?
                        I am planning to remove the VxLAN and re-create it. Do you think it would help ?

                        Update : Checked XO Logs:
                        I get :

                        2023-08-11T14:00:17.907Z xo:xo-server:sdn-controller:tls-connect ERROR TLS connection failed {
                        xen-orchestra-docker-orchestra-1  |   error: [Error: 58EBD01E7F7F0000:error:0A000418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1586:SSL alert number 48] 
                        {
                        |     library: 'SSL routines',
                        |     reason: 'tlsv1 alert unknown ca',
                        |     code: 'ERR_SSL_TLSV1_ALERT_UNKNOWN_CA'
                        |   },
                        

                        Though earlier, the connection between Node1 and Node 2 were still being made despite this warning. So this is not new.

                        Further Update:

                        I deleted the previous VxLAN and now when I try to create it, it gives me the following error :

                        sdnController.createPrivateNetwork
                        {
                          "poolIds": [
                            "b990a09e(*removed*)"
                          ],
                          "pifIds": [
                            "68fc6193(*removed*)"
                          ],
                          "name": "VxLAN",
                          "description": "Private Lan Network",
                          "encapsulation": "vxlan",
                          "encrypted": true,
                          "mtu": 1500
                        }
                        {
                          "library": "SSL routines",
                          "reason": "tlsv1 alert unknown ca",
                          "code": "ERR_SSL_TLSV1_ALERT_UNKNOWN_CA",
                          "message": "582B645BC47F0000:error:0A000418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1586:SSL alert number 48
                        ",
                          "name": "Error",
                          "stack": "Error: 582B645BC47F0000:error:0A000418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1586:SSL alert number 48
                        "
                        }
                        
                        1 Reply Last reply Reply Quote 0
                        • G Offline
                          gb.123 @ronan-a
                          last edited by

                          @ronan-a

                          running linstor command on both hosts gives the following error :

                          Traceback (most recent call last):
                            File "/usr/bin/linstor", line 21, in <module>
                              import linstor_client_main
                          ImportError: No module named linstor_client_main
                          
                          ronan-aR 1 Reply Last reply Reply Quote 0
                          • SwenS Offline
                            Swen @olivierlambert
                            last edited by

                            @olivierlambert did you already had the change to test your new hardware?

                            We did some more benchmarking with only bonded 40G interface.
                            We used the following fio command:
                            fio --name=a --direct=1 --bs=1M --iodepth=32 --ioengine=libaio --rw=write

                            1. on bare-metal (OS: Ubuntu 22 LTS) we are able to reach 3100MB/s
                            2. on a VM installed on xcp-ng we are able to reach 1200MB/s
                            3. on dom0 we are able to reach 1300MB/s when create a new linstor volume and using /dev/drbd directly
                            4. on dom0 when using the lvm without drbd we are able to reach 1500MB/s

                            And btw it looks like tapdisk is our bottleneck on dom0 as you suggested before, because this is a single-thread process and our CPU reached a limit.

                            From the numbers above the performance inside the VM does not look as bad as we thought at the beginning. The only question we have at this moment is why we are "loosing" over half of the performance between a bare-metal installation and when testing the same storage from within dom0?
                            Is this expected behavior?

                            olivierlambertO 1 Reply Last reply Reply Quote 0
                            • G Offline
                              gb.123
                              last edited by

                              This post is deleted!
                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO @Swen
                                last edited by

                                @Swen Yes, it is expected. The Dom0 is NOT the baremetal but a VM in PV mode. Also, it doesn't have all the resources of the host (as a VM, only a fraction of CPUs/memory, but also not -for now- a recent kernel and IO uring.

                                1 Reply Last reply Reply Quote 0
                                • ronan-aR Offline
                                  ronan-a Vates 🪐 XCP-ng Team @gb.123
                                  last edited by

                                  @gb-123 said in XOSTOR hyperconvergence preview:

                                  @ronan-a

                                  running linstor command on both hosts gives the following error :

                                  Traceback (most recent call last):
                                  File "/usr/bin/linstor", line 21, in <module>
                                  import linstor_client_main
                                  ImportError: No module named linstor_client_main

                                  Regarding this error, can you say me what's your XCP-ng version? And what's the output of: yum list installed | grep -i linstor? Do you have a XCP-ng 8.3? And did you run the beta installation script on it?

                                  Also, your SSL/VLAN issue is not caused by LINSTOR, @BenjiReis do you have any idea?

                                  G 2 Replies Last reply Reply Quote 0
                                  • G Offline
                                    gb.123 @ronan-a
                                    last edited by gb.123

                                    @ronan-a said in XOSTOR hyperconvergence preview:

                                    @gb-123 said in XOSTOR hyperconvergence preview:

                                    @ronan-a

                                    running linstor command on both hosts gives the following error :

                                    Traceback (most recent call last):
                                    File "/usr/bin/linstor", line 21, in <module>
                                    import linstor_client_main
                                    ImportError: No module named linstor_client_main

                                    Regarding this error, can you say me what's your XCP-ng version? And what's the output of: yum list installed | grep -i linstor?

                                    Output of yum list installed | grep -i linstor :

                                    drbd.x86_64                     9.22.0-1.el7               @xcp-ng-linstor      
                                    drbd-bash-completion.x86_64     9.22.0-1.el7               @xcp-ng-linstor      
                                    drbd-pacemaker.x86_64           9.22.0-1.el7               @xcp-ng-linstor      
                                    drbd-reactor.x86_64             1.0.0-1                    @xcp-ng-linstor      
                                    drbd-udev.x86_64                9.22.0-1.el7               @xcp-ng-linstor      
                                    drbd-utils.x86_64               9.22.0-1.el7               @xcp-ng-linstor      
                                    drbd-xen.x86_64                 9.22.0-1.el7               @xcp-ng-linstor      
                                    kmod-drbd.x86_64                9.2.2+ptf.1_4.19.0+1-1     @xcp-ng-linstor      
                                    linstor-client.noarch           1.18.0-1                   @xcp-ng-linstor      
                                    linstor-common.noarch           1.21.1-1.el7               @xcp-ng-linstor      
                                    linstor-controller.noarch       1.21.1-1.el7               @xcp-ng-linstor      
                                    linstor-satellite.noarch        1.21.1-1.el7               @xcp-ng-linstor      
                                    python-linstor.noarch           1.18.0-1                   @xcp-ng-linstor      
                                    xcp-ng-linstor.noarch           1.2-1.xcpng8.3             @xcp-ng-linstor      
                                    xcp-ng-release-linstor.noarch   1.4-1.xcpng8.3             @xcp-ng-base
                                    

                                    Do you have a XCP-ng 8.3? And did you run the beta installation script on it?

                                    I installed XCP-ng 8.3 beta from ISO and then applied all the patches that were available. To install linstor, I followed the 1st Post of this topic.

                                    Also, your SSL/VLAN issue is not caused by LINSTOR, @BenjiReis do you have any idea?

                                    Yeah, even I thought so, but nothing changed for me before I installed Linstor.

                                    There is also a very peculiar thing I am noticing:
                                    Backup from Thin (Ext4) SR to another Thin (Ext4) SR has a transfer of 120KiBs over network(WAN). I thought this is maybe due to a 'slow' network, but however, I started another backup at the same time while the previous one was running slowly, this time from Thick (LVM) SR to Thin SR and the speed I got was around 16 MiB/s (which seems to be OK). I did write a post regarding this but deleted it since I need to dig deeper into this before I report this as an issue to you.
                                    I will be conducting tests for a few more days just to be sure. (I am talking about Continuous Backup btw.)

                                    Update :
                                    After a week of continuous testing, I am getting mixed results on the backup speed as mentioned in the above paragraph. I can now say that the issue may not be related to Linstor.

                                    BenjiReisB 1 Reply Last reply Reply Quote 0
                                    • Maelstrom96M Offline
                                      Maelstrom96
                                      last edited by

                                      Is there a procedure on how we can update our current 8.2 XCP-ng cluster to 8.3? My undertanding is that if I update the host using the ISO, it will effectively wipe all changes that were made to DOM0, including the linstor/sm-linstor packages.

                                      Maelstrom96M 1 Reply Last reply Reply Quote 0
                                      • BenjiReisB Offline
                                        BenjiReis Vates 🪐 XCP-ng Team @gb.123
                                        last edited by

                                        @gb-123 are you using XOA or XO from sources?
                                        If from sources the issue might come from having a different openssl when creating the sdn-controller certificates.
                                        You can either try with an XOA or generate manually your certificates.

                                        G 1 Reply Last reply Reply Quote 0
                                        • G Offline
                                          gb.123 @BenjiReis
                                          last edited by

                                          @BenjiReis said in XOSTOR hyperconvergence preview:

                                          @gb-123 are you using XOA or XO from sources?
                                          If from sources the issue might come from having a different openssl when creating the sdn-controller certificates.
                                          You can either try with an XOA or generate manually your certificates.

                                          Using XO from sources. I just turned on "Over-ride certificates" and reinstalled the whole XO virtual machine. Seems to work fine now.

                                          My only problem was that why it suddenly stopped working when I installed Linstore as installing Linstor should not impact upon this. So thats why I reported this on this thread. 🙂

                                          BenjiReisB 1 Reply Last reply Reply Quote 0
                                          • BenjiReisB Offline
                                            BenjiReis Vates 🪐 XCP-ng Team @gb.123
                                            last edited by

                                            @gb-123 I see thanks 🙂

                                            Just bad timing imho because Linstor doesn't touch this part of the host and the openssl issue is more probably coming from the env when you run your XO.

                                            anyway, glad it's working now!

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post