ZFS-ng: an intro on SMAPIv3

Devblog Sep 23, 2022

Following our overview of the existing storage stacks in XCP-ng, we wanted to share with you our initial work building a ZFS driver on top of the latest storage stack, called SMAPIv3. If you missed it, read our previous article for more details:

Understanding the storage stack in XCP-ng
Follow this guide to understand the existing and the future storage architecture in XCP-ng.

Nonetheless, this article is still a first episode, and we'll test more things in future articles (different drives, configurations, RAID levels and so on).

BYOD: Build Your Own Driver

What's great with SMAPIv3 are the choices you have:

  • you choose or create your own datapath (the way your data is read and written between the VM and the storage)
  • you can build from scratch a "driver" to manage your volumes (virtual disk creation, deletion, snapshots…)

This level of choice allows you to get rid of the "VHD format"-monolith you had with SMAPIv1. For our first official PoC on top of SMAPIv3, we made the choice to create an implementation of ZFS volumes, using a "raw" datapath from tapdisk. No modification is needed from the VM at all.

Tapdisk is a Dom0 user-space process that makes the link between what your VM sees as a block device via its PV drivers (like xvda for example) and what's stored inside the Dom0 (eg a ZFS volume). It can work in raw or in vhd mode.

In the future, we will probably build other datapath plugins for specific use cases or performance reasons. tapdisk is great but we imagine implementing completely custom stuff too (like… SPDK? 👼).

Getting most of the usual commands

Now that we've made our datapath choice, we have to write a volume driver. This is done by writing some mapping between VDI commands and what we'll tell the driver to execute (SR creation, volume commands and such).

In our ZFS case, a simple example is to tell it that a VDI.create will do a zfs create -V {volume_name} {volume_size} behind the scenes. The same applies for VDI.snapshot (yay! native ZFS snapshots!) and others, like VDI.destroy and such.

Native storage capabilities

It also means that our ZFS SMAPIv3 implementation won't have to deal with snapshot chains, which are basically copy-on-write chains that you have to coalesce yourself, since it will be entirely managed by ZFS itself (and its Zvols). Clearly, ZFS will be far more efficient than a custom program running in Dom0 to merge blocks in Copy on Write file format (vhd , qcow2 or whatever similar), and even read/write penalty to use a chain will be reduced since it will be entirely delegated by ZFS.

Wrapping everything

When you get all these commands, it's time to wrap it up and deploy the driver on your host. Then, it's up to you to create your SR without any specific ZFS knowledge:

xe sr-create type=zfs-ng name-label=zfsng device-config:file-uri=file:///dev/sdb

This will create the Zpool for you, while being ready to create new volumes:

zfsng  19,5G   422K  19,5G        -         -     0%     0%  1.00x    ONLINE  -

Now, you can use Xen Orchestra to create your new disks, as if it was a normal Storage Repository.


So how does the ZFS-ng driver compare vs the "legacy" (SMAPIv1) ZFS driver? If you don't remember, the SMAPIv1 ZFS driver is pretty basic:

  • you need to create your ZFS pool manually
  • then you tell the SMAPIv1 ZFS driver to use the mount point
  • after that, it will create flat VHD files inside it

So you still rely on tapdisk in VHD mode, and all the coalesce work will be handled by SMAPIv1 (garbage collecting, coalescing and so on). All of this is avoided with the new ZFS driver.

Right now, our PoC is pretty basic and creates the Zpool without compression. But we'll probably enable lz4 by default since it's always a win.

🐄 Requiem for a CoW

ZFS in "file system" mode is using CoW (in fact, it's actually a RoW: Redirect on Write, see this article). Adding a pseudo-filesystem on top of it is REALLY BAD. CoW-on-RoW (or Cow-on-CoW) will cause various amplifications in the datapath. We'll see how catastrophic it is in most situations, unlike using a "plain"/raw block storage for the VM disk (and leave the RoW doing the heavy lifting). Let's test!

Test setup

On the hardware side, we have an HPE ProLiant DL385 Gen10 with an AMD EPYC 7302, with XCP-ng 8.2.1 (8GiB RAM for the Dom0). The test VM is Debian 11, with 4 vCPUs and 4GiB RAM. The single test drive is a cheap Samsung QVO 1TiB SSD.

We obviously rely on fio to do our benchmarks. In random read/write, we use 4k blocksize. 4M is used for sequential. IOdepth is set to 128 and the size to bench is 10GiB on a 50GiB virtual disk. To compare apples to apples, we use lz4 compression on both drivers (legacy ZFS vs ZFS-ng).

Unless specified, we use fio directly on the guest device (eg /dev/xvdb) instead of using a filesystem on top of it.

We are doing our tests on a single Samsung QVO 1TiB (using cheap QLC nand). Results might be different on an NVMe drive but also on a spinning HDD (especially where ZFS cache will matter a lot). Expect more benchmarks in the future!

Sequential throughput

First benchmark is without snapshot chain whatsoever, just one active disk:

Speed is in MiB/s, more is better.

In sequential write, ZFS-ng is simply twice as fast vs storing VHDs on a ZFS filesystem. In read, it's even better: almost tripling the speed. But that's even not the most visible gap!

Random IOPS

Let's do the same benchmark with random reads and writes:

IOPS, more is better

I should probably chose a log scale, because the difference is HUGE. In fact, in random read it's roughly 13 times faster, while being simply 12 times faster in random write.

And with a snapshot chain?

Let's see if having a snapshot will change something in both scenarios. We first create the disk, we snapshot it and finally then we add 10GiB with fio:

IOPS, more is better

The impact is non-negligible in ZFS on SMAPIv1. Few % in random write, but larger in random read!

I only compared in terms of IOPS because sequential read/write penalty is too small to be visible, giving very similar results. Why? Because on a CoW filesystem, every read or write will need an extra… read (on the whole chain). It's very visible in a 4k random read result.

So what about the same test in SMAPIv3 with ZFS-ng?

IOPS, more is better

I will consider this an "overall" draw. So it means our current ZFS-ng driver doesn't suffer from read or write penalty when using a snapshot, at least at this small scale. I triple-checked the results, and yes, the snapshot is here:

# zfs list
tank     4,14M   899G      108K  /mnt/zfs
zfsng    51,6G   847G     47,5K  /zfsng
zfsng/1  51,6G   889G     10,1G  -
zfsng/2     0B   847G       12K  -

Apples to apples?

Okay so we compared a storage using tapdisk with a copy-on-write file format inside (vhd mode) vs tapdisk in raw format leveraging ZFS storage capabilities (snapshots and so on). What if we use a raw format on top of SMAPIv1 ZFS storage (instead of VHD)? We can, thanks to a "custom" VDI create command:

xe vdi-create type=user sm-config:type=raw virtual-size=50GiB sr-uuid=<SR_UUID> name-label=rawdrive

And the result is… the same! Regardless of tapdisk mode, ZFS on SMAPIv1 is slower because it relies on the ZFS filesystem layer, unlike ZFS-ng writing blocks all along the data path to the ZFS volume.

In short, using a file system like ZFS, with RoW/CoW capabilities without actually using them (because we rely on VHD format), is just a waste of resources.

Comparing against non-CoW filesystem

Okay but what about a local ext SR against our new ZFS-ng driver?

MiB/s, more is better
First run vs second run: when you are using a VHD file format on top of an existing filesystem (like ext), the performance will be reduced for your first write benchmark. This is explained easily: when your VHD file grow with the writes, your local filesystem has to manage a growing file, using more resources. The overall penalty can grow up to 50% of capable write speed. Luckily, as soon the space is used, you'll enjoy the maximum speed. In production use-cases, this is barely visible. In our benchmarks here, we always do at least 3 runs and use the average between the last 2.

Those results are interesting. ZFS-ng is a bit better in sequential read, which could be caused by the ZFS lz4 volume compression that we do not have with ext (so the result is more variable depending on the data set). In sequential write, the speed with local ext is still better. In the end, that's likely because ext is far more simple (ie "dumb") so there's less overhead.

But we can say that using a CoW file based storage (VHD) on top of a "non-CoW" filesystem (like ext4) works very well, since we can even reach 50k IOPS. At least for one virtual disk and one physical disk. But keep in mind that in VHD mode, the Dom0 will have to deal with the whole chain logic by itself (coalesce, garbage collecting removed snapshots and so on). The toll of doing this is in fact bigger than we can imagine:

  • when you have tons of VHD files, each rescan/cleanup will take longer
  • coalescing multiple chains is also taking a lot of resources
  • race conditions are very dangerous if your code isn't perfect (eg: one missing parent and the whole chain is dead!)

All of this can be dealt with directly by your favorite CoW/RoW filesystem, like ZFS, leaving the Dom0 to focus on better things: less complexity, less risks, more mature dedicated solutions for your data, and so on.

But also… what about having ZFS managing your RAID? Mirrors, RAIDz and so on: ZFS is offering you a very powerful storage solution. In a coming article, we'll test up to 4 drives in RAIDz1, RAIDz2 and 2x mirrored vdev against a equivalent mdadm solution with ext4 on top, and see the results.

Does it scale?

One usual known bottleneck of SMAPIv1 is the performance per disk (which is the worst case scenario). Luckily, when you use multiple disks with multiple VMs, SMAPIv1 in VHD actually scales. We regularly see SMAPIv1 pools saturating very fast shared storage. Expect some benchmarks soon with multiple VMs/disks with this driver to see if it scales, but since we use a very similar datapath, I'm pretty sure it will. Stay tuned!

Another datapath?

It's also possible to test another datapath, since it's relatively simple in that case, using raw blocks. However, tapdisk is pretty mature and efficient. Another alternative is to tweak tapdisk to use faster code to read/write on a device, like io_uring. And yes, there's also plans for it! Damien, our PhD at Vates, is also working on getting an SPDK datapath for NVMe drives, which should bring very powerful storage performances! The future of storage in XCP-ng is VERY exciting!

What's missing then?

This first PoC ZFS-ng driver will be obviously available for those who want to test it (look for a announcement on our forum soon). We are integrating cool features pretty fast like the ability to create multiple kinds of ZFS setups (mirrors, RAIDZ1, Z2…) without using any ZFS commands. So your xe sr-create command will do everything for you when you pass the right parameters.

So what's left? Right now, the elephant in the room is the inability to migrate a VDI from SMAPIv1 to SMAPIv3, at all. Right now, you simply can't migrate any VDI at all in the SMAPIv3 world. And obviously, no live storage migration either. These are the things we are exploring. Another detail: you don't have any IOPS/throughput reporting view for SMAPIv3 SRs yet (no RRDs data exported so nothing visible in XO).

What about backup and delta capabilities? Some tests need to be done, but we'll clearly need to adapt Xen Orchestra to it: for that, we'll likely rely on NBD and that will be very interesting vs VHD export. Indeed, our initial tests are showing a LOT better performance in NBD than in VHD. But again, some pieces are missing, more news on this later 😊


I hope you liked this "tour" on the current ZFS PoC we are working on. By the way, if you are passionate about low-level storage solutions, we are hiring 😉


Olivier Lambert

Vates CEO & co-founder, Xen Orchestra and XCP-ng project creator. Enthusiast entrepreneur and Open Source advocate. A very happy Finnish Lapphund owner.