Since December, both our teams worked in a conjoint effort in order to make LINSTOR - the main solution from LINBIT - integrated with XCP-ng.
We faced more challenges that we initially thought during the process, but we also learned a lot about the storage stack in Xen, DRBD technology and the XAPI itself. Time to share!
Where are we now?
We have made a lot of progress during the last months. DRBD is working inside XCP-ng and we are now able to install LINSTOR RPMs to manage DRBD disks on top of LVM devices. In short, the biggest part of the solution is working as expected.
There are also some features requiring more tests, like HA for instance. It's still a challenging issue as the way the XAPI is handling the HA is incompatible with the way LINSTOR is handling devices. To summarize, in XCP-ng, a file descriptor is open on every host and is used to write the status of the host using the Heartbeat mechanism of the xha daemon (XenServer HA). This is how HA works in XCP-ng. We cannot use this method in DRBD as this block device cannot have multiple files descriptor on a single device. But good news, we implemented a patch to close and open a DRBD device when needed in the xha daemon to work around this limitation. But we must still improve it and test it at scale!
Finally, everything is currently handled via the CLI, and we have not yet started to work on the GUI for this solution. This is not a very challenging step, but UI is usually a late stage part and we are not here yet. Also, we still need to discuss the best way to display the numerous deploy option you can have with LINSTOR.
We have run a lot of benchmarks on this new solution (maybe we will share the data in a very technical blogpost for those who cares) but here are some of the most interesting scenario.
We are comparing a "normal" VM using a local SSD SR (`vm-ext`). The first DRBD test is with a VM and a resource not replicated. There's a little difference, but it's marginal.
This case was tested with fio and 4M blocks, in a sequential fashion.
When you have a VM with a disk replicated on another host in live, you can see something immediately: we are completely saturating a 10G link! This is the limiting factor in sequential write. It's not the case for the read speed, that stays very high, because it can use the local SSD itself for that. It's huge because it means you can have crazy read speed and also likely better write speed if your network is able to keep up. We'll probably try to get some 40G hardware at some point. Note that SSDs used on a 1G link would be completely useless.
Diskless scenario is also interesting: you'll saturate your 10G link for both read and write, because the VM disk is not local to the host it runs. Which is also something very powerful, allowing you to get a host running VMs without having any local SR.
In this case, what matters is the network latency. We are using very small blocks (4k). We are using relatively cheap switches, so the result might be much better with higher-end hardware. Also, our rather old CPUs will likely have an impact on that. This is visible because the diskless VM currently get better performances, leaving the compute power to the other host.
We are using ioping and the unit is µsec. As we can see, duplicating the data will raise the latency a bit, but the price to pay is more on diskless VMs, which is rather logical. In the end, the extra latency is very moderate!
Our work with LINBIT continues as we report any bugs/issues we found, until we reach the private beta stage. We are truly happy with this collaboration, and we are both impressed and eager to bring this great technology inside XCP-ng!
Stay tuned for the next blog post