Need Help Understanding the VM Suspend Process
-
Hi,
Suspend means all the VM memory (RAM) has to be written on the Suspend SR. More RAM == more time to suspend.
-
@olivierlambert Aaah, I've always wondered what the Suspend SR on the Advanced tab of the Pool meant......now I know.
So, the NIC capacity/pipe between the hosts and the SR does really matter here.
-
Sharing this so others might benefit from what I'm learning.
So, I looked at the network performance on TrueNAS during the Smart Reboot of the second XCP-ng host (screenshot below). What I saw seems to suggest that I'm getting near wire speed during READ operations. However, WRITE operations seem to be hitting a ceiling and I have a feeling it might be due to me having SYNC enabled on the dataset.
-
Indeed, sync is likely the cause of that. And since you have to write the RAM, it can be a lot of GiB for big VMs.
You can experiment to disable sync temporarily and test again.
Note that a future improvement will be to save the VM RAM while the VM is running (instead of pausing it), reducing the "downtime". But this won't change the fact you must write the RAM somewhere, and this takes time.
-
@olivierlambert Yes, I do plan on testing with SYNC disabled and then again with several permutations of dataset changes on the TrueNAS side (like compression on/off, etc.).
Do you guys have a best practices document for setting up an NFS SR using TrueNAS? I browsed through the published XCP-ng documentation site but didn't find anything specific to TrueNAS or maybe I missed it.
-
Nothing very specific, sync vs async should be the biggest change
-
@olivierlambert Oh okay, thanks for responding.
So I turned off SYNC and COMPRESSION on the dataset and retested (by suspending 11 VMs), I immediately noticed a whopping performance improvement (essentially sustained wire speeds AND 50% faster completion time*) :
- Roughly 984 Mb/s sustained WRITE speeds (during VM suspension)
- Roughly 984 Mb/s sustained READ speeds (during VM resumption)
- Transfer time for both READ and WRITE is about 20 minutes (down from 40-45 mins)
Gonna retest with SYNC disabled and COMPRESSION re-enabled and see if it degrades performance; standby for another report.
-
Here's the latest, and probably last, test. Disabling compression had no appreciable impact on performance. I am now fully convinced that SYNC is the major player here.
-
Yes it is and it's pretty logical since you don't have to wait confirmation before getting the block actually written on the drive.
-
@olivierlambert Yes sir, it is and I'm glad I confirmed this for myself. Thanks also for helping me understand how the VM Suspend process works. Hopefully this post helps other newbies with the same understanding in the future.
-
O olivierlambert marked this topic as a question
-
O olivierlambert has marked this topic as solved