XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XOSTOR hyperconvergence preview

    Scheduled Pinned Locked Moved XOSTOR
    457 Posts 50 Posters 538.0k Views 53 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • SwenS Offline
      Swen @abufrejoval
      last edited by

      @abufrejoval thx for your feedback. I need to investigate this further. We already using a different switch for the 10Gbit interfaces with another IP subnet.

      1 Reply Last reply Reply Quote 0
      • SwenS Offline
        Swen
        last edited by

        @ronan-a I was unable to find some limitations regardings the bandwidth of an interface. Do you know anything about it?

        1 Reply Last reply Reply Quote 0
        • A Offline
          abufrejoval Top contributor @Swen
          last edited by abufrejoval

          @Swen

          How do you measure? Do you measure disk I/O e.g. via Jens Axboe's wounderful fio tool or do you measure at the network level e.g. via iperf3first?

          I've gotten around 300MB/s write speeds inside a Windows VM using Crystal Disk Mark with 4-way LINSTOR replication using Xcp-ng running nested under VMware Workstation on Windows (Ryzen 9 5950X 16-core with plenty of RAM all NVMe storage).

          Iperf3 between these virtual Xcp-ng hosts will only yield around 5Gbit/s, so 300MB/s is rather better than I'd expect, given that each block is replicated 4 times. Reads on Crystal Disk Mark are better than 1.3GB/s as they don't suffer from write amplification and could actually be done round-robin (and it seems they are, too).

          But that's a nested virtualization setup, which is really just meant for functional failure testing, not for meaningful benchmarking.

          I haven't gotten around to using LINSTOR yet on my physical NUC8/10/11 cluster using 10Gbit NICs, but they give me close to 10Gbit/s with iperf3, while a Xeon-D 1542 based host only reaches about 5-6Gbit/s with budget Aquantia ACC107 NICs all around, that don't support much in terms of offload capabilities.

          On oVirt I used an MTU of 9000 to reach full 10Gbit bandwidth on all machines, but I haven't found any documentation on how to increase the MTU on the physical NICs in Xcp-ng yet.

          SwenS 1 Reply Last reply Reply Quote 0
          • SwenS Offline
            Swen @abufrejoval
            last edited by

            @abufrejoval I am using dd on Ubuntu20 VMs on 3 ProLiant D360 servers with SSDs. I mounted 1 SSD directly to XCP-ng and 3 to linstor on each server. When I do a

            dd if=/dev/zero of=benchfile bs=4k count=2000000 && sync; rm benchfile
            

            on a VM using local storage I get around 185MB/s

            when I do the same on 1 VM on linstor storage I get around 125MB/s

            but when I do the test on 2 VM on linstor storage on the same XCP-ng host I get around 60MB/s each.

            Do me it looks like the NIC is the bottleneck, but please correct me if I am wrong.

            SwenS A 2 Replies Last reply Reply Quote 0
            • SwenS Offline
              Swen @Swen
              last edited by

              @ronan-a another thing I found is that it linstor occupies more storage than expected. I created the sr with option 'thin'. I created 2 VMs each with 50GB disk. XCP-ng cente ris shoing me

              238.7 GB used of 2.6 TB total (150 GB allocated)
              

              I would not expected that! I would expected less than 100 GB used and allocated.

              ronan-aR 1 Reply Last reply Reply Quote 0
              • ronan-aR Offline
                ronan-a Vates 🪐 XCP-ng Team @Swen
                last edited by

                @Swen Could you list the VDIs of your linstor SR please? 🙂

                SwenS 1 Reply Last reply Reply Quote 0
                • SwenS Offline
                  Swen @ronan-a
                  last edited by Swen

                  @ronan-a sure, do you mean the output of xe vdi-list?

                  ronan-aR 1 Reply Last reply Reply Quote 0
                  • ronan-aR Offline
                    ronan-a Vates 🪐 XCP-ng Team @Swen
                    last edited by

                    @Swen Yes, because this allocation value is indeed surprising.

                    SwenS 2 Replies Last reply Reply Quote 0
                    • SwenS Offline
                      Swen @ronan-a
                      last edited by

                      @ronan-a

                      [16:30 xcp-test1 ~]# xe vdi-list sr-uuid=77e5097a-c971-34e4-9506-7386a1e640b8
                      uuid ( RO)                : 23876ae4-27b3-4f2f-8c8b-eb623b2dc2e4
                                name-label ( RW): base copy
                          name-description ( RW):
                                   sr-uuid ( RO): 77e5097a-c971-34e4-9506-7386a1e640b8
                              virtual-size ( RO): 53687091200
                                  sharable ( RO): false
                                 read-only ( RO): true
                      
                      
                      uuid ( RO)                : 3a2ab3da-5507-4c7e-aa07-497c65b18ec1
                                name-label ( RW): ubuntu20-linstor 0
                          name-description ( RW): Created by template provisioner
                                   sr-uuid ( RO): 77e5097a-c971-34e4-9506-7386a1e640b8
                              virtual-size ( RO): 53687091200
                                  sharable ( RO): false
                                 read-only ( RO): false
                      
                      
                      uuid ( RO)                : 13a8fa52-9aa3-490b-86e0-eedb101128f9
                                name-label ( RW): ubuntu20-linstor 0
                          name-description ( RW): Created by template provisioner
                                   sr-uuid ( RO): 77e5097a-c971-34e4-9506-7386a1e640b8
                              virtual-size ( RO): 53687091200
                                  sharable ( RO): false
                                 read-only ( RO): false
                      

                      ok, the third vdi makes sense, cause I used storage-level fast disk clone to duplicate the VM. This explains the allocated value I guess, but not the used one.

                      Did you see my other question? Are you aware of any NIC constraints regarding throughput?

                      1 Reply Last reply Reply Quote 0
                      • SwenS Offline
                        Swen @ronan-a
                        last edited by

                        @ronan-a Wait a sec, maybe I found the root cause. I created a snapshot of a VM and deleted it. It created another base copy vdi and allocated space is now 200GB. MAybe I need to wait for the celanup job to take care of this?

                        ronan-aR 1 Reply Last reply Reply Quote 0
                        • ronan-aR Offline
                          ronan-a Vates 🪐 XCP-ng Team @Swen
                          last edited by ronan-a

                          @Swen The 150GiB are related to the base copy VDI yes. 😉
                          Of course this value is just the maximum amount of data used because you use the thin LVM plugin. (It's not the real used data.)

                          Regarding NIC, I didn't encounter any problems during my tests. The best way to measure the DRBD performance is to use fio directly in a VM and also on the host with a DRBD volume.

                          The difference between local storage and DRBD is not a surprise:

                          • DRBD must sync the data between nodes
                          • DRBD is on top of LVM
                          SwenS 1 Reply Last reply Reply Quote 0
                          • A Offline
                            abufrejoval Top contributor @Swen
                            last edited by

                            @Swen
                            Writing zeros should result in nothing written with thin allocation (or dedup and compression): that's why I am hesitant to use /dev/zero as a source.

                            Of course /dev/random could require to much of an overhead, depending on the quality and implementation which is why I like to use fio: a bit of initial effort to know and understand the tool, but much better control, especially when it comes to dealing with an OS that tries to be smart.

                            1 Reply Last reply Reply Quote 0
                            • SwenS Offline
                              Swen @ronan-a
                              last edited by

                              @ronan-a did you use 10Gbit interfaces for linstor traffic? I am aware that there is a difference between local storage and DRBD, but if this difference is that high, linstor is not really interesting for high performance workloads. I need to be sure that the root cause it not related to my setup.

                              @ronan-a @abufrejoval which exact fio params are you using to test your environment and can you copy some numbers, so we can compare them?

                              A 1 Reply Last reply Reply Quote 0
                              • olivierlambertO Online
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by olivierlambert

                                We mostly use those displayed in this blog post: https://smcleod.net/tech/2016/04/29/benchmarking-io/

                                edit: depending on the storage, iodepth can be increased.

                                1 Reply Last reply Reply Quote 0
                                • A Offline
                                  abufrejoval Top contributor @Swen
                                  last edited by

                                  @Swen

                                  There is obviously tons of variations....

                                  I've used this fio file a lot to quickly gain an understanding of how a bit of storage performs.

                                  Basically it only uses a small 100MB file, but tells the OS to avoid buffering and then goes over that with a mix of reads and writes, mostly transitioning between block size, essentially going from super random to almost sequential in a single run.

                                  It's helped me find issues with Gluster, identify network bandwidth issues or even find deteriorated RAIDs with a bad BBU. Creates the test file in the working directiory unless changed.

                                  [global]
                                  filename=fio.file
                                  ioengine=libaio
                                  rw=randrw
                                  size=100m
                                  norandommap
                                  direct=1
                                  iodepth=1
                                  time_based
                                  runtime=10
                                  [B512]
                                  bs=512
                                  stonewall
                                  [B1k]
                                  bs=1k
                                  stonewall
                                  [B2k]
                                  bs=2k
                                  stonewall
                                  [b4k]
                                  bs=4k
                                  stonewall
                                  [b8k]
                                  bs=8k
                                  stonewall
                                  [b16k]
                                  bs=16k
                                  stonewall
                                  [b32k]
                                  bs=32k
                                  [b64k]
                                  bs=64k
                                  stonewall
                                  [b512k]
                                  bs=512k
                                  stonewall
                                  [b1m]
                                  bs=1m
                                  stonewall
                                  

                                  Numbers: It should approach the network bandwidth towards the end (potentially divided by write amplification).

                                  1 Reply Last reply Reply Quote 0
                                  • dumarjoD Offline
                                    dumarjo @ronan-a
                                    last edited by

                                    @ronan-a Hi,

                                    I tested your branch and now the new added hosts to the pool are now attached to the XOSTOR. This is nice !

                                    I have looked at the code, but I'm not sure if in the current state of your branch we can add a disk on the new host and update the replication ? I think not... but just to be sure.

                                    ronan-aR 1 Reply Last reply Reply Quote 1
                                    • ronan-aR Offline
                                      ronan-a Vates 🪐 XCP-ng Team @dumarjo
                                      last edited by

                                      @dumarjo linstor resource-group modify --place-count=X should be enough to update the replication. 🙂 I'm not sure to add a command in the plugin now (but probably yes for XOA integration).

                                      1 Reply Last reply Reply Quote 0
                                      • Maelstrom96M Offline
                                        Maelstrom96 @ronan-a
                                        last edited by

                                        @ronan-a said in XOSTOR hyperconvergence preview:

                                        For some VMs that have built-in software replication/HA, like DBs, it might be prefered to have replication=1 set for the VDI.

                                        We can authorize this behavior without having other SRs. It would suffice to pass a replication parameter for this particular VDI when it is created. So thank you for this feedback. I think we must implement this use case for the future.

                                        @ronan-a Have anything been done regarding this feature? I scanned the thread, but I couldn't really find anything related to a new VDI option.

                                        1 Reply Last reply Reply Quote 0
                                        • olivierlambertO Online
                                          olivierlambert Vates 🪐 Co-Founder CEO
                                          last edited by

                                          It might be done in the future, but that's not the priority for a v1 🙂

                                          Maelstrom96M 1 Reply Last reply Reply Quote 0
                                          • Maelstrom96M Offline
                                            Maelstrom96 @olivierlambert
                                            last edited by

                                            @olivierlambert
                                            I just checked the sm repository, and it looks like it wouldn't be that complicated to add a new sm-config and pass it down to the volume creation. Do you accept PR/Contributions on that repository? We're really interested in this feature and I think I can take the time to write the code to handle this.

                                            1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post