Testing ZFS with XCP-ng



  • I all,

    I spent some time trying to get ZFS to work.

    The main issue is that ZFS doesn't support O_DIRECT and that blktap only opens .vhd files with O_DIRECT.

    I patched blktap to remove the O_DIRECT and replace it with O_DSYNC and try to run ZFS on a file SR. Here are some changes: https://gist.github.com/nraynaud/b32bd612f5856d1efc1418233d3aeb0f

    The steps are the following:

    • remove the current blktap from your host
    • install the patched blktap
    • install the ZFS support RPMs
    • create the and mount the ZFS volume
    • create the file SR pointing to the mounted directory

    Here is a ZFS test script:

    XCP_HOST_UNDER_TEST=root@yourmachine
    rsync -r zfs blktap-3.5.0-1.12test.x86_64.rpm ${XCP_HOST_UNDER_TEST}:
    ssh ${XCP_HOST_UNDER_TEST} yum remove -y blktap
    ssh ${XCP_HOST_UNDER_TEST} yum install -y blktap-3.5.0-1.12test.x86_64.rpm
    ssh ${XCP_HOST_UNDER_TEST} yum install -y zfs/*.rpm
    ssh ${XCP_HOST_UNDER_TEST} depmod -a
    ssh ${XCP_HOST_UNDER_TEST} modprobe zfs
    ssh ${XCP_HOST_UNDER_TEST} zpool create -f -m /mnt/zfs tank /dev/mydev
    ssh ${XCP_HOST_UNDER_TEST} zfs set sync=disabled tank
    ssh ${XCP_HOST_UNDER_TEST} zfs set compression=lz4 tank
    ssh ${XCP_HOST_UNDER_TEST} zfs list
    
    SR_ZFS_UUID=`ssh ${XCP_HOST_UNDER_TEST} "xe sr-create host-uuid=${XCP_HOST_UNDER_TEST_UUID} name-label=test-zfs-sr type=file other-config:o_direct=false device-config:location=/mnt/zfs/test-zfs-sr"`
    

    In details

    ZFS options

    Obviously, you can create the ZFS volume as you need (mirror, RAID-Z, doesn't matter, it's entirely up to you!)

    Regarding options: due to O_DIRECT removal, it seems that the default sync mode is as if you got sync=always. To enjoy ZFS real perfs, it's better to use sync=disabled.

    Compression (using lz4) also offers a nice boost. Finally, if you are a bit kamikaze, you can try dedup, but frankly, not worth the risk.

    Note: please extend your dom0 memory to enjoy more perfs in general! 4GiB seems a minimum, remember that your host will now act as a cache for your storage.

    SR creation

    We added a specific flag to allow SR usage without O_DIRECT:

    xe sr-create host-uuid=<YOUR_HOST_UUID> name-label=ZFS-SR type=file other-config:o_direct=false device-config:location=/tank
    

    You can see the other-config:o_direct=false flag required.

    Download

    The files are here: https://cloud.vates.fr/index.php/s/zDSLcsJ4l5tYHwL?path=%2F

    I would be interested in someone else reproducing my work, fiddle with ZFS and tell us if it's worth pursuing.

    Thanks,

    Nico.



  • ๐Ÿ’ป I did a test ๐Ÿ™‚

    Environment: XCP-ng 7.4 with latest yum updates

    ZFS works as expected! (Note: ashift=12 --> I have 4k discs)
    0_1529342989011_b86f601e-ee4a-49ee-a010-9aa67e017828-grafik.png

    Test1 - File SR ontop ZFS-Folder

    • created a new folder zfs create test/sr which is located at /mnt/zfs/sr
    • created a new file-sr xe sr-create host-uuid=<host-uuid> type=file shared=false name-label=<sr-label> device-config:location=/mnt/zfs/sr

    Result: could not boot my ubuntu-vm

    Test2 - Ext-SR ontop ZFS Blockdevice

    • created a new blockdevice zfs create -V 100G test/ext-sr
    • created a new EXT-SR xe sr-create content-type=user host-uuid=<host-uuid> shared=false type=ext device-config:device=/dev/zvol/test/ext-sr name-label=<sr-label>

    Result: here my ubunt-vm also does not boot

    Test 3 - Boot ubuntu-vm on "old" lvm SR

    Result: could also not boot my ubuntu-vm

    Hints

    • local copy of the vm was done without errors

    • in all three cases the vm got stuck on following boot-screen:
      0_1529344323371_b8fa7113-ccca-4be3-8534-eff9df5bc6c4-grafik.png

    • turning off the stuck vm takes a long time

    Test 4

    (after reading carefully your script and using other-config:o_direct=false)

    Result: vm is booting ๐Ÿ™‚ ๐Ÿ™‚ ๐Ÿ™‚



  • Feel free to do some test (with fio for example, with bs=4k to test IOPS and bs=4M to test max throughput)



  • just now I watch the soccer games in russia, so just a little test ๐Ÿ˜‰

    dd inside vm

    root@ubuntu:~# dd if=/root/test.img of=/dev/null bs=4k
    812740+0 Datensรคtze ein
    812740+0 Datensรคtze aus
    3328983040 bytes (3,3 GB, 3,1 GiB) copied, 17,4961 s, 190 MB/s
    

    dd on host

    [root@xen sr3]# dd if=./83377032-f3d5-411f-b049-2e76048fd4a2.vhd of=/dev/null bs=4k 
    2307479+1 Datensรคtze ein
    2307479+1 Datensรคtze aus
    9451434496 Bytes (9,5 GB) kopiert, 38,5861 s, 245 MB/s
    

    A side note: My disc's are happy now with ZFS, I hope you can see it ๐Ÿ˜‰

    0_1529348589739_9603ad14-8a18-49a5-9b20-7e203acb35f4-grafik.png



  • @nraynaud Nice Work! Will report on weekend with my tests.

    @olivierlambert I love fio. bonnie++ is also worth a shot.

    @borzel you have a floppy drive!!!



  • thanks @r1

    I started a PR here: https://github.com/xapi-project/blktap/pull/253 it's not exactly the same code as the RPM above. Above, O_DIRECT has been completely removed altogether, so that people can compare ext3 without O_DIRECT and ZFS. In the PR, I am just removing O_DIRECT when an open() has failed.

    It is not sure it will be accepted, I have no clue what their policy is, and no idea what their QA process is, we are really at the draft stage.

    Nico.



  • @nraynaud may be an extra config parameter while sr-create will be a right way. Handling a fail may not seem appealing. My 2c.



  • Status update:

    I installed XCP-ng 7.4 on a "Storage Server" with 16 x 2TB HDDs, gave DOM0 8GB of RAM and created a zpool (Sync off, Compress LZ4) and file-SR ontop.

    In combination with a local XenOrchestra VM (and configured continious replication jobs) we now have a selfcontained "All-in-One Replication Server" without dependency on hardware raid controllers ๐Ÿ™‚

    In case of an emergency in my production storage/virtualisation environment I can boot every VM I need with just this one server. Or power up my DHCP-VM if the other systems are down for service on chrismas/new years eve ๐Ÿ˜‰

    I'm more than very happy with that ๐Ÿ™‚

    Side note: This is of course not ready for "production" and doesn't replace our real backup...just in case someone asks ๐Ÿ˜‰



  • How's the performance?



  • Currently hundred of gigabytes gets transfered ๐Ÿ™‚

    I switched xenorchestra from the VM to our hardware backup server, so we can use the 10Gbit cards. (The VM run iperf about 1,6 Gbit/s, the hardware at around 6 GBit/s.)

    The data flow is:
    NFS-Storage(ZFS) -> XenServer -> XO -> XCP-ng (ZFS), all connected over 10GBit Ethernet.

    I did not measure correct numbers, was just impressed and happy that it runs ๐Ÿ™‚

    I will try to express my data and meanings more detailed later...

    Setup

    0_1530657646909_f6ca6db4-cd16-44bf-9db4-8dacfe3be0df-grafik.png

    Settings

    • Alle servers are connected via 10GBit ethernet cards (Emulex OnConnect)
    • Xen1 and Xen2 are a pool, Xen1 is master
    • XenOrchestra connects per http:// (no SSL to improve speed)
    • DOM0 has 8GB RAM on XCP-ng AiO (so ZFS does use 4GB)

    Numbers

    • iperf between Backupserver and XCP-ng AiO: ~6 GBit's
    • Transfer speed of Continous Replication Job from Xen-Pool to XCP-ng AiO (output of zpool iostat with 30 seconds sample rate on XCP-ng AiO)

    0_1530658170537_7044bd63-39c3-4f93-aa3f-e97a596b6f26-grafik.png

    The speed varies a bit because of lz4 compression, current Compress Ratio: 1.64x

    Here a quick look at the graphs (on XCP-ng AiO)

    0_1530658863169_485d9208-50d8-4051-9853-762e8e02c201-grafik.png

    and a quick look at top (on XCP-ng AiO)

    0_1530659073283_c3c22e43-e039-4ddf-8d7c-f4700d37f9ff-grafik.png

    and arcstat.py (sample rate 1sec) (on XCP-ng AiO)

    0_1530659347665_97dd73d9-2d83-4c6b-a57b-d8d5528e4b74-grafik.png

    and arc_summary.py (on XCP-ng AiO)

    0_1530659545510_0c723c27-1666-4943-992c-c203c79178a5-grafik.png

    @olivierlambert I hope you can see somehting with these numbers ๐Ÿ’ป



  • To avoid the problem with XenOrchestra runing as VM (and not using the full 10GBit of the host) I would install it directly inside DOM0. It could be preinstalled in XCP-ng as WEB-Gui ๐Ÿ™‚

    Maybe I try this later on my AiO Server ...



  • Not a good idea to embed XO inside Dom0โ€ฆ (also it will cause issue if you have multiple XOs, it's meat to run at one place only).

    That's why the virtual appliance is better ๐Ÿ™‚



  • @olivierlambert My biggest Question: Why is this not a good Idea?

    How do big datacenters solve the issue with the performance? Just install on Hardware? With virtual Appliance you are limited to GBit speed.

    I run more than one XO's, one for Backups and one for Useraccess. Don't want users on my Backupserver.

    Maybe this is something for a separat thread...



  • VM speed is not limited to GBit speed. XO proxies will deal with big infrastructures to avoid bottlenecks.





  • I did a little more testing with ZFS, tried to move a disk from localpool to srpool

    Setup

    • XCP-ng on single 500Gb disk, ZFS-Pool created on a separat partition on that disk (~420GB) -> localpool -> SR named localzfs
    • ZFS-Pool on 4 x 500GB Disks (Raid10) -> srpool -> SR named sr1

    Error

    0_1530972133820_4eefffb1-2464-4985-a46d-da41cc6403b3-grafik.png

    Output of /var/log/SMLog

     Jul  7 15:44:07 xen SM: [28282] ['uuidgen', '-r']
     Jul  7 15:44:07 xen SM: [28282]   pread SUCCESS
     Jul  7 15:44:07 xen SM: [28282] lock: opening lock file /var/lock/sm/4c47f4b0-2504-fe56-085c-1ffe2269ddea/sr
     Jul  7 15:44:07 xen SM: [28282] lock: acquired /var/lock/sm/4c47f4b0-2504-fe56-085c-1ffe2269ddea/sr
     Jul  7 15:44:07 xen SM: [28282] vdi_create {'sr_uuid': '4c47f4b0-2504-fe56-085c-1ffe2269ddea', 'subtask_of':      'DummyRef:|62abb588-91e6-46d7-8c89-a1be48c843bd|VDI.create', 'vdi_type': 'system', 'args': ['10737418240', 'xo-os', '', '', 'false', '19700101T00:00:00Z', '', 'false'], 'o_direct': False, 'host_ref': 'OpaqueRef:ada301d3-f232-3421-527a-d511fd63f8c6', 'session_ref': 'OpaqueRef:02edf603-3cdf-444a-af75-1fb0a8d71216', 'device_config': {'SRmaster': 'true', 'location': '/mnt/srpool/sr1'}, 'command': 'vdi_create', 'sr_ref': 'OpaqueRef:0c8c4688-d402-4329-a99b-6273401246ec', 'vdi_sm_config': {}}
     Jul  7 15:44:07 xen SM: [28282] ['/usr/sbin/td-util', 'create', 'vhd', '10240', '/var/run/sr-mount/4c47f4b0-2504-fe56-085c-1ffe2269ddea/11a040ba-53fd-4119-80b3-7ea5a7e134c6.vhd']
     Jul  7 15:44:07 xen SM: [28282]   pread SUCCESS
     Jul  7 15:44:07 xen SM: [28282] ['/usr/sbin/td-util', 'query', 'vhd', '-v', '/var/run/sr-mount/4c47f4b0-2504-fe56-085c-     1ffe2269ddea/11a040ba-53fd-4119-80b3-7ea5a7e134c6.vhd']
     Jul  7 15:44:07 xen SM: [28282]   pread SUCCESS
     Jul  7 15:44:07 xen SM: [28282] lock: released /var/lock/sm/4c47f4b0-2504-fe56-085c-1ffe2269ddea/sr
     Jul  7 15:44:07 xen SM: [28282] lock: closed /var/lock/sm/4c47f4b0-2504-fe56-085c-1ffe2269ddea/sr
     Jul  7 15:44:07 xen SM: [28311] lock: opening lock file /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:07 xen SM: [28311] lock: acquired /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:07 xen SM: [28311] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/319aebee-24f0-8232-0d9e-ad42a75d8154/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0.vhd']
     Jul  7 15:44:07 xen SM: [28311]   pread SUCCESS
     Jul  7 15:44:07 xen SM: [28311] vdi_attach {'sr_uuid': '319aebee-24f0-8232-0d9e-ad42a75d8154', 'subtask_of': 'DummyRef:|b0cce7fa-75bb-4677-981a-4383e5fe1487|VDI.attach', 'vdi_ref': 'OpaqueRef:9daa47a5-1cae-40c2-b62e-88a767f38877', 'vdi_on_boot': 'persist', 'args': ['false'], 'o_direct': False, 'vdi_location': 'ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0', 'host_ref': 'OpaqueRef:ada301d3-f232-3421-527a-d511fd63f8c6', 'session_ref': 'OpaqueRef:a27b0e88-cebe-41a7-bc0c-ae0eb46dd011', 'device_config': {'SRmaster': 'true', 'location': '/localpool/sr'}, 'command': 'vdi_attach', 'vdi_allow_caching': 'false', 'sr_ref': 'OpaqueRef:e4d0225c-1737-4496-8701-86a7f7ac18c1', 'vdi_uuid': 'ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0'}
     Jul  7 15:44:07 xen SM: [28311] lock: opening lock file /var/lock/sm/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0/vdi
     Jul  7 15:44:07 xen SM: [28311] result: {'o_direct_reason': 'SR_NOT_SUPPORTED', 'params': '/dev/sm/backend/319aebee-24f0-8232-0d9e-ad42a75d8154/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0', 'o_direct': True, 'xenstore_data': {'scsi/0x12/0x80': 'AIAAEmNhODhmOGRjLTRhYjYtNGIgIA==', 'scsi/0x12/0x83': 'AIMAMQIBAC1YRU5TUkMgIGNhODhmOGRjLTRhYjYtNGJlNy05MDg0LTFiNDFjYmM4YzFjMCA=', 'vdi-uuid': 'ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0', 'mem-pool': '319aebee-24f0-8232-0d9e-ad42a75d8154'}}
     Jul  7 15:44:07 xen SM: [28311] lock: closed /var/lock/sm/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0/vdi
     Jul  7 15:44:07 xen SM: [28311] lock: released /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:07 xen SM: [28311] lock: closed /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:08 xen SM: [28337] lock: opening lock file /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:08 xen SM: [28337] lock: acquired /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:08 xen SM: [28337] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/319aebee-24f0-8232-0d9e-ad42a75d8154/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0.vhd']
     Jul  7 15:44:08 xen SM: [28337]   pread SUCCESS
     Jul  7 15:44:08 xen SM: [28337] lock: released /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:08 xen SM: [28337] vdi_activate {'sr_uuid': '319aebee-24f0-8232-0d9e-ad42a75d8154', 'subtask_of': 'DummyRef:|1f6f53c2-0d1e-462c-aeea-5d109e55d44f|VDI.activate', 'vdi_ref': 'OpaqueRef:9daa47a5-1cae-40c2-b62e-88a767f38877', 'vdi_on_boot': 'persist', 'args': ['false'], 'o_direct': False, 'vdi_location': 'ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0', 'host_ref': 'OpaqueRef:ada301d3-f232-3421-527a-d511fd63f8c6', 'session_ref': 'OpaqueRef:392bd2b6-2a84-48ca-96f1-fe0bdaa96f20', 'device_config': {'SRmaster': 'true', 'location': '/localpool/sr'}, 'command': 'vdi_activate', 'vdi_allow_caching': 'false', 'sr_ref': 'OpaqueRef:e4d0225c-1737-4496-8701-86a7f7ac18c1', 'vdi_uuid': 'ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0'}
     Jul  7 15:44:08 xen SM: [28337] lock: opening lock file /var/lock/sm/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0/vdi
     Jul  7 15:44:08 xen SM: [28337] blktap2.activate
     Jul  7 15:44:08 xen SM: [28337] lock: acquired /var/lock/sm/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0/vdi
     Jul  7 15:44:08 xen SM: [28337] Adding tag to: ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0
     Jul  7 15:44:08 xen SM: [28337] Activate lock succeeded
     Jul  7 15:44:08 xen SM: [28337] lock: opening lock file /var/lock/sm/319aebee-24f0-8232-0d9e-ad42a75d8154/sr
     Jul  7 15:44:08 xen SM: [28337] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/319aebee-24f0-8232-0d9e-ad42a75d8154/ca88f8dc-4ab6-4be7-9084-1b41cbc8c1c0.vhd']
     Jul  7 15:44:08 xen SM: [28337]   pread SUCCESS


  • @borzel said in Testing ZFS with XCP-ng:

    Jul 7 15:44:07 xen SM: [28311] result: {'o_direct_reason': 'SR_NOT_SUPPORTED',

    Did find it after posting this and looking onto it... is this a known thing?



  • that's weird: https://github.com/xapi-project/sm/blob/master/drivers/blktap2.py#L1002

    I don't even know how my tests did work before.



  • Maybe because it wasn't triggered without a VDI live migration?



  • did no livemigration ...just local copy of a vm
    ... ah, I should read before I answer ...