XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    How to improve backup performance, where's my bottleneck

    Scheduled Pinned Locked Moved Xen Orchestra
    19 Posts 4 Posters 5.0k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M Offline
      markhewitt1978 @olivierlambert
      last edited by

      @olivierlambert said in How to improve backup performance, where's my bottleneck:

      That's because on how streams are working, there's a back pressure applied. I don't think we keep any buffer (or a very tiny one) on the buffer level.

      I'll ask @julien-f about this.

      Yes indeed, I've noticed that over a long time and the amount varies with the amount of memory XOA is given. The default of 2GB very little, I've given it 32GB so it stores quite a bit.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        That's a good question, I have a very partial understanding on how Node streams are working at low-level. This is more a question for Node experts 😛

        1 Reply Last reply Reply Quote 0
        • M Offline
          markhewitt1978
          last edited by

          b5cf96e8-5965-4828-a808-9ee3d0586d3c-image.png

          The main question being. If the disks can take 300MByte/s in burst, could they maintain that all the time? Perhaps not. Just that nikade's post above makes me wonder if there is some 1Gbit (with overheads) limit in node itself.

          It's most likely disk speed limited, but interesting to explore all the same.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            I don't think it's a Node limit, my bet is the content that require a lot of random seeks everywhere, limiting the storage speed. It would be interesting to do your same backup on a very fast NVMe drive, just for the benchmark.

            1 Reply Last reply Reply Quote 1
            • M Offline
              markhewitt1978
              last edited by

              As it happens I do have an expiring server I can use for testing with NVME. So XOA stays where it is and I establish a new CentOS7 VM on this host to act as a remote. It's using a standard xcp-ng VDI not passthrough disk like the other one.

              An fio test gives: write: IOPS=135k, BW=526MiB/s (551MB/s)(1950MiB/3711msec)

              Now keep in mind my current backups are still running through XOA - it's a production backup.

              However I kick off a backup from a different subset of servers that the current live backup is runing from. It hits the NVME NFS server and writes to disk at .... 30MBps :Z

              The most interesting aspect is the incoming bandwidth to XOA. You'd think if it were a backpressure issue due to not being able to write to the disk of the remote fast enough, that the incoming bandwidth would increase as it's sending it out to two hosts. In fact the opposite seems to happen, plus the bandwdith being sent to my 'live' backup decreases in proportion to the bandwidth now being sent to the NVME remote.

              2317e848-5b08-41b2-80ef-f37806e3334a-image.png

              I'm not sure what this means but it is interesting.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Indeed, it's interesting. You did the fio randwrite test with 4k?

                1 Reply Last reply Reply Quote 0
                • M Offline
                  markhewitt1978
                  last edited by

                  I used the command line you sent to me

                  fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite --ramp_time=4
                  test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
                  fio-3.7
                  Starting 1 process
                  test: No I/O performed by libaio, perhaps try --debug=io option for details?4s]
                  
                  test: (groupid=0, jobs=1): err= 0: pid=2290: Thu Jan 30 10:24:50 2020
                    write: IOPS=148k, BW=576MiB/s (604MB/s)(1905MiB/3306msec)
                     bw (  KiB/s): min=554144, max=599104, per=99.60%, avg=587724.00, stdev=16710.46, samples=6
                     iops        : min=138536, max=149776, avg=146931.00, stdev=4177.61, samples=6
                    cpu          : usr=11.56%, sys=42.24%, ctx=30262, majf=0, minf=23
                    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=215.0%
                       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
                       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
                       issued rwts: total=0,487662,0,0 short=0,0,0,0 dropped=0,0,0,0
                       latency   : target=0, window=0, percentile=100.00%, depth=64
                  
                  Run status group 0 (all jobs):
                    WRITE: bw=576MiB/s (604MB/s), 576MiB/s-576MiB/s (604MB/s-604MB/s), io=1905MiB (1998MB), run=3306-3306msec
                  
                  Disk stats (read/write):
                    xvdb: ios=0/1037381, merge=0/2159, ticks=0/437297, in_queue=437329, util=98.68%
                  
                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    Great, so indeed, this storage should be able to deal with a lot of random write IOPS 🙂

                    1 Reply Last reply Reply Quote 0
                    • M Offline
                      markhewitt1978
                      last edited by

                      I think the next experiment would be to use the upcoming backup proxies to remove XOA from the equation and allow for multiple locations feeding into the NFS remote.

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Yes, but to compare apples to apples, you'll need to avoid using "bigger" network bandwidth (combining proxies bandwidth). You have already 10G everywhere, right?

                        M 1 Reply Last reply Reply Quote 0
                        • M Offline
                          markhewitt1978 @olivierlambert
                          last edited by

                          @olivierlambert said in How to improve backup performance, where's my bottleneck:

                          Yes, but to compare apples to apples, you'll need to avoid using "bigger" network bandwidth (combining proxies bandwidth). You have already 10G everywhere, right?

                          No. The backups server has 10G but the server it's backing up have a mixture of 1G and 2G.

                          At the moment things are acceptable, and disk performance is going to be a limitation anyway, this is now from a personal interest side!

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            And this is great that you are interested to that, because it helps to push things forward 🙂 Eager to have proxies coming in beta.

                            1 Reply Last reply Reply Quote 0
                            • S Offline
                              snigy
                              last edited by snigy

                              server has 10G DAC to dedicated 10G SW, Disk its writing to is a NAS iSCSI Target with a single 10GBe NIC. Network Traffic is low under 30Mb/s
                              ![alt text](Screenshot from 2020-02-01 16-26-10.png image url)
                              png)
                              pdd.png
                              first peek is memory read.

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post