XCP-ng with ZFS

News Aug 3, 2018

Since XCP-ng 7.5, you can use ZFS as a local storage backend. You could already do this with a ZFS box exporting a share via NFS, but this time, ZFS is running directly on your host.

We'll see how it's possible and do some basic benchmarks.

WARNING: This is a first implementation, considered as experimental. Do not play with fire. Or at least backup your VMs often with Xen Orchestra.

Introduction

For those who don't know, ZFS is a formidable copy-on-write filesystem, allowing transparent compression, cache, tiering, deduplication and much more. You can read more about it on this page.

ZFS is not easy to run on XCP-ng/XenServer. Why's that? Because it doesn't support the O_DIRECT flag. And the XCP-ng/XenServer storage stack relies on it quite a bit. Basically, ZFS is not compatible.

Unless you do some modification in the storage code, by allowing optional O_DIRECT bypass for a specific storage repository. This way, all your current storage won't have a behavior change but only ZFS. That's why the change is safe and was even merged in the mainline of XenServer! However, we don't know when this change will be integrated into an officiel Citrix XenServer release, and it's very likely Citrix won't support it. But XCP-ng does!

Install/enable ZFS on your hosts

On each host that you want to run ZFS on, you need to follow these steps:

yum install --enablerepo="xcp-ng-extras" blktap vhd-tool

Also install ZFS packages built for XCP-ng:

yum install --enablerepo="xcp-ng-extras" kmod-spl-4.4.0+10 kmod-zfs-4.4.0+10 spl zfs

Finally, enable the module with depmod -a && modprobe zfs.

Note: Again, ZFS support is still experimental. Please use it for non-critical stuff for now or use Xen Orchestra to have backups.

General ZFS rules

More RAM is always better. ZFS uses your RAM as a read cache. It's up to you to decide how much you want to dedicate to accelerating your storage, versus using it for running your VMs.
SSD as read cache (L2ARC) will also boost your frequent read blocks. The bigger the SSD, bigger your read cache - however it is highly recommended to max out RAM first as L2ARC will also use RAM to store the L2ARC headers - the bigger the SSD, the more RAM will be used for these headers.
Want ultra-fast synchronous writes (eg NFS with sync=on by default)? You'll need an SLOG device. Basically, synchronous writes will be written very fast asynchronously to your storage, while they are also synchronously written to the SLOG device. if a problem happens with the async write to your normal disks, ZFS will read the "good" copy from your SLOG device and write it to your disks again. In normal use, an SLOG device will never be read from - only written to. Be sure to choose an SSD with high write endurance, or it will wear out!

Benchmarks

Before doing ZFS benchmarks, we'll run a baseline with the current hardware. First, we'll use a basic 1TiB 7200rpm drive as an SR LVM backend. Everything will run on the same hardware, a small Dell T30 with an Intel® Xeon® processor E3-1225 v5 and 32GiB of RAM.

For the sake of simplicity (to be understood by everyone, even non-Linux experts), we'll use Crystal Diskmark on Windows 2012 R2 64bits, with all updates installed. We use a 4GiB dataset.

More benchmarks on Linux with FIO will probably come later.

Note: the system disk isn't running locally, but on an NFS share. This will avoid performance glitches due to Windows doing unknown stuff in the background. All benchmarks are done on a dedicated virtual disk with the same size for every run.

Also, please note this is a very simple example with a single drive, you can obviously do a LOT more with powerful RAID modes (RAID-Z, striped or mirror mode, RAID10-like etc.). You can learn more here: http://www.zfsbuild.com/2010/05/26/zfs-raid-levels/

Local LVM HDD

As you can see and as expected, we are hitting HDD limits easily:

First line is sequential read and write speed. Thanks to read ahead, performance is good for an HDD. But as soon as you start to do random reads and writes with small blocks, a physical spinning disk will always have troubles.

ZFS volume

Let's go create our ZFS volume. We'll start very simply with just our free HDD partition.

Default settings

Let's create the ZFS volume on sda4 partition:

zpool create -o ashift=12 -m /mnt/zfs tank /dev/sda4

Now, let's create the XCP-ng SR (Storage Repository):

xe sr-create host-uuid=<HOST_UUID> name-label=test-zfs-sr type=file other-config:o_direct=false device-config:location=/mnt/zfs/test-zfs-sr
zfs set sync=disabled tank

We disable sync because of a suspicion of cache poisoning with blktap. This is something we are still investigating, and blktap issues are probably still impacting write speed even without sync. It's likely that' s not really "async" because we observed same speed in "sync" with the same hardware on a full non-virtualized host. See the conclusion for more details.

Note: We have a very small amount of RAM in our dom0 for ZFS (dom0 got just 2GiB in total).

So let's redo our tests this time:

That's interesting: we got a better sequential read speed, but a worse write speed. ZFS seems constrained by the very low amount of RAM available.

In random read/write scenario, there is no miracle when you don't have enough cache available, this is roughly the same speed as the physical device, a bit better but nothing huge.

LZ4 compression

Okay, let's wipe this test VDI, enable compression and try the same tests again. Just doing zfs set compression=lz4 tank will activate it. All new disks created after this command is ran will be compressed (not the previous one).

Why there is no change? Because the dataset here can't be efficiently compressed, it's fully random.

Indeed, we can see that the compress ratio is only 1.01x in ZFS, which means nothing.

Let's switch to a non-randomized content and let's redo the test:

Okay, that's almost a caricature and not a realistic load, but as you can see, if you have compressible data, LZ4 compression will help you a LOT, for a very small CPU usage.

L2ARC SSD

"L2ARC" means "Level 2 ARC" (Level 1 is… RAM). So it's a read cache, filling with a mix of the most recently read data, along with the most oftenly read data. Since we have a small SSD inside the machine, we'll use it. It's a Samsung EVO 850, not a big beast. In our system, it's visible as /dev/sdb:

zpool add tank cache /dev/sdb

That's it! Is there any impact? As planned, read speed if far better, especially in random read, because the SSD provides a cache:

Note: that would be better with a real load, with some VMs reading the same blocks often, more data would have been promoted into the SSD, enhancing the read speed further.

Also, an L2ARC cache on a very limited amount of RAM won't help you a lot. It could even be counterproductive (ZFS needs to store an index of cached data in RAM, so the larger your L2ARC disk, the more RAM space is used for these L2ARC headers). So it means less RAM for first level cache.

Linux benchmarks with FIO will give us the opportunity to warm up the cache, and see the result in better conditions.

More RAM in Dom0

Okay, let's make one final change. ZFS is meant to run with a lot of RAM. I mean, a lot. 16GiB is the first decent minimum for very good performance. And at the very least to operate correctly - 2GiB, which wasn't the case in our first bench. Because I have a 32GiB RAM host, I'll extend the dom0 to 8GiB or RAM, and see if it's better.

To do so: /opt/xensource/libexec/xen-cmdline --set-xen dom0_mem=8192M,max:8192M

Then reboot the host. And let's do another benchmark:

The bottleneck in terms of sequential read speed is… the tapdisk process (CPU bound), the component linking the VHD file to the VM. Obviously, this is something we could improve in the future of XCP-ng.

As you can see, we are almost beating the HDD sequential read while doing random read, which is very good.

In short, if your dataset can be cached in RAM, you'll have RAM speed for your VMs read requests. Which can be very interesting for read intensive operations.

Regarding write speed, we can probably do better. I suspect a bottlneck in the blktap/tapdisk process with cache poisoning, which impacts how fast a write can be done inside ZFS.

Bonus: Linux benchmarks

On a CentOS 7 VM up to date, for read spead:

READ: bw=340MiB/s (357MB/s), 340MiB/s-340MiB/s (357MB/s-357MB/s), io=4096MiB (4295MB), run=12035-12035msec

For write speed:

WRITE: bw=185MiB/s (194MB/s), 185MiB/s-185MiB/s (194MB/s-194MB/s), io=4096MiB (4295MB), run=22087-22087msec

This is relatively consistent with the Windows experience (a bit better in write).

Conclusion

If a lot of read speed is required for your VM infrastructure, ZFS on XCP-ng can be a really powerful solution. You basically trade RAM as read cache for your VM storage, but you can keep a large SR (thin provisioned!) with classical HDDs to store your data.

We are almost sure there is still a big write bottleneck in XCP-ng/XenServer storage path: the next step would be to investigate on it: for example, we could start to make experiments with smapiv3, which is not using blktap in userspace anymore. We are confident this will unleash the full ZFS power for your XCP-ng hosts!

In the end, if you prefer to have very fast write speed, having an SSD in a local LVM SR would be the best choice, but the cost is not the same.