Switching to XCP-NG, want to hear your problems

rtjdamen

@flakpyro are u using ssd storage? Of so i would recommend increasing the leaf coalesce parameters. This will prevent this loops from occuring. We had issues with this this prior to changing it. Now we do not see this anymore.

flakpyro

@rtjdamen The Pure storage array is all flash. I did see your comments about increasing those values, i did do that and while it did help it still can be hit or miss for us. For example our sharepoint server took 2-3 hours the other night to coalesce when using full CBT with snapshot delete enabled. Running it without snapshot delete the traditional snapshot coalesce only takes maybe 10-15 mins.

With that said we are also having issues with Pure's NFS implementation and how it interacts with XCP-NG, causing storage timeouts for us. According to them the array, when under load is disconnecting hosts due to "expired NFS leases" we are currently working with them to stabilize that, perhaps then i can revisit full CBT backups. Weirdly enough the disconnects do not appear to happen under regular VM operations even under high load, we mostly run into these during backup runs with CBT enabled runs increasing the chance of it happening.

What we ideally will end up with is local backups, offsite backups and archival backups. This is what we had with Veeam prior. Doing this with traditional XOA backups would result in 3 snapshots per VM (1 for each job) which lead me to looking into commvault since it does things a bit differently. CBT would also solve this as well i assume.

rtjdamen

@flakpyro ah i understand, i have seen issues with nfs as well, doing it with iscsi did give us much better results.

Also i would recommend you to look into alike a3 backup, we use it with high io vms with 2tb disk size, no issues there and u need only one snapshot per vm.

nikade

@flakpyro what NFS version are you using?
We're backing up 50-60 VM's every night (without CBT) and the coalescale is pretty quick, i've tried NFS 4 but we had some timeout issues so we went back to NFS 3 which seems stable.

Have you tried to experiment with the mount options?
Just curious, since we didnt have any luck with NFS 4 on both TrueNAS and our Dell Powerstore, the latter is NVME allflash and the TrueNAS boxes are 10K SAS with SSD as cache.

flakpyro

@nikade Interesting to hear we are not the only ones experiencing timeouts with V4 We are on NFS 4.1 with the Pure array on the latest firmware. I initially wanted to use v4 as it is a stateful protocol and thought it may handle controller failovers better due to that. Perhaps i should try V3 and see if it fairs better. We had okay luck with V4 on TrueNAS but never really ran it under any extreme load. It will run for 2 or 3 days without issue then suddenly NFS drops appear in dmesg on the hosts.

For example:

dmesg -T from the host shows the following with 10.174.199.25 being the array. 

[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying
[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying
[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying
[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying

Followed by recovery after some time:

[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK
[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK
[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK
[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK

Is that what you were seeing on your powerstore with V4 as well?

nikade

@flakpyro Thanks for replying!
Yeah we've not had much luck with the NFS 4... but we've been using NFS 3 for years (Since 2017 I think) and it has been rock solid, failover between controllers are not even causing a single timeout in dmesg on the hosts.

I've tried to tweak the mount options, but I didn't have much luck, thats the reason why I asked if you had played around with it.

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@nikade Interesting to hear we are not the only ones experiencing timeouts with V4 We are on NFS 4.1 with the Pure array on the latest firmware. I initially wanted to use v4 as it is a stateful protocol and thought it may handle controller failovers better due to that. Perhaps i should try V3 and see if it fairs better. We had okay luck with V4 on TrueNAS but never really ran it under any extreme load. It will run for 2 or 3 days without issue then suddenly NFS drops appear in dmesg on the hosts.

For example:
dmesg -T from the host shows the following with 10.174.199.25 being the array. 

[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying
[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying
[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying
[Thu Aug 15 01:28:30 2024] nfs: server 10.174.199.25 not responding, still trying

Followed by recovery after some time:

[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK
[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK
[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK
[Thu Aug 15 01:29:49 2024] nfs: server 10.174.199.25 OK
Is that what you were seeing on your powerstore with V4 as well?

We actually had some timeouts like these on our "slower" TrueNAS, we got help from @yannik I think to tweak the mount options when our backups failed.
Ever since they tweaked the mount options it has been rock solid and we havent seen any timeouts in dmesg.

flakpyro

@nikade I am not using any custom mount options other than "hard" to do a hard nfs mount to prevent data lose when drops like these happen.

Did you have to use custom mount options with V3 as well then or just with V4? I may try moving VMs over to a V3 mount from V4 to see if that helps stabilize things.

rfx77

@flakpyro
To be honest we gave up on Xen for our systems where we have shared storage. We swiched to HyperV.

The Lack of CBT Support in CommVault was a major problem. We cannot use XO for backups because it lacks 90% of the features we now have with CommVault (Dedup, Tape, Agents, IntelliSnap of VM disks, AUX Copys,..) and we dont want to combine it with CommVault agents since this complicates our Backups and dies not make real sense.

HyperV with Clustering is free in our setup so it was not a big discussion there. We also can utilize the full potential (performance and feature wise) of our SAN and Network which always was a major issue with Xen.

We keep XCP-NG for now on some specialized installs where we need to map physical hardware into a vm.

flakpyro

@rfx77 Makes sense. I was thinking of giving the Commvault trial a try with XCP-NG since it looked like while they don't use Xens' CBT, they still tracked changed blocks within their helper VM while still doing dedupe.

Backups have been the biggest set back in our move from VMware, i knew going in i would miss Veeam more than i'd miss Vmware itself.

As for NFS 4.1 vs 3, if the timeouts return this week i think i will give v3 a try if it worked more reliability for you.

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@nikade I am not using any custom mount options other than "hard" to do a hard nfs mount to prevent data lose when drops like these happen.

Did you have to use custom mount options with V3 as well then or just with V4? I may try moving VMs over to a V3 mount from V4 to see if that helps stabilize things.

In XOA we got some mount options from @yannik but in XCP we have not had to use any special options when mounting the NFS SR (as long as we're using NFS 3).

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@rfx77 Makes sense. I was thinking of giving the Commvault trial a try with XCP-NG since it looked like while they don't use Xens' CBT, they still tracked changed blocks within their helper VM while still doing dedupe.

Backups have been the biggest set back in our move from VMware, i knew going in i would miss Veeam more than i'd miss Vmware itself.

As for NFS 4.1 vs 3, if the timeouts return this week i think i will give v3 a try if it worked more reliability for you.

Yeah, Veeam really is the king of backups. We're backing up about 50 VM's with Veeam from our vmware clusters and man it is sooo fast and reliable, i've seen backups go at 7Gbit/s which is incredible.

flakpyro

@nikade Yeah Veeam was very set and forget, we backed up around 100 VMs a night locally and to our DR site and it just reliably worked and i never really had to think about it. Once CBT stabilizes in XCP-NG i think that will go a long way in helping but i don't think its quite production ready yet.

Im hoping to eventually get to the same point with XCP-NG, be it with XOA backups or with something like commvault. Our NFS mounts last dropped Thursday night during backups and have been fine since so its VERY intermittent. I think if it happens again i will begin moving to NFS3 backed SRs and hope that solves it.

What issues with commvault did you run into due to its lack of native Xen CBT? Was their own internal change block tracking not reliable? I have a feeling the xen portion of the product does not see a lot of development attention from reading their documentation.

nikade

@flakpyro I never tried commvault so I cant really tell, but I've tried Acronis and Quadric but performance wasn't too great, atleast not better than XOA so there was no point.

rfx77

@flakpyro

Our Issue with commvault was that despite it mounts the snapshots in the proxy VM it has to read them as a whole. So Inc Backups take nearly as long as Full backups. When you want to backup 20TB+ Incs every night, Thats not acceptable. So the Lack of CBT is really a big deal when doing VM Backups.

Performance with CommVault would be not that big of a problem since you can do multiple streams and multiple VMs simultanously but we could not get past 600MB/s which is a Xen Problem as of our Testing. With HyperV we see Performance in access of 1.6GB/s in the same Scenarios. But also this performance would be much to slow to do Incs every night.

We tested XO CBT but it was not stable and for what i read in the posts there seems to be problems with CBT and Live-Migration so that the CBT state seems to be lost. Also a big Problem for us.

When we used CommVault to backup multiple VMs we ran into Blue-Screens of the Xen Toolstack in the Windows VMs when attaching or detaching Snapshot-VDIs to the Proxy VMs. There clearly is a bug in the Xen Windows Drivers. So we had to reduce concurrency which reduced backup speed.

We didn't do a short test with xen, we migrated out internal production Cluster to it from VMWare and used it for about 4 month now (30+ VMs on iSCSI Storage with 3 Nodes) and after we ran into more and more Problems we had many discussions with our team and we had to ask ourselfes what Xen brings to the table that is worth the drawbacks. The only scenario where it fits for us is where we have to map physical hardware into VMs.

We decided to come back in some time to see how the SMAPIv3 drivers work out and if there is a better support for shared storage.

flakpyro

@rfx77 As a follow up v4 did cause more downtime for us, i switched over everything to v3 which has been much better so far. Going to be curious to see how a controller failover goes during a firmware update as v3 is stateless vs v4 and iscsi being stateful protocols.

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@rfx77 As a follow up v4 did cause more downtime for us, i switched over everything to v3 which has been much better so far. Going to be curious to see how a controller failover goes during a firmware update as v3 is stateless vs v4 and iscsi being stateful protocols.

We're also using v3 and failover on our Dell Powerstore's are seamless.
Haven't tried v4 since we had A LOT of "nfs server not responding" issues with it and immediately went back to v3.

CodeMercenary

Interesting to know that v3 seems to be more reliable that v4. I had repeated problems with using NFS for a backup remote and those problems only went away when I changed the remotes to use SMB. I know NFS would be better to use but a backup that happens through an inferior protocol is way better than one that fails using a better protocol.

Maybe I should give NFS another chance but force it to use v3.

In my case, I'd get backups working on NFS and then several days later a backup would fail. Then backups fail every day until I intervene, usually by rebooting the XO VM. Sometimes I'd then have to do cleanup, like releasing a VDI or something. Then it may or may not start working again but if it did start working I'd have another failure a few days later. It's been 2.5 weeks since I switched it to SMB and have had no failures. That's definitely the longest I've gone without a failure from a delta backup to a networked drive.

Note, I also have backups going to a local drive mounted in XO so with all those remote failures I always had a clean backup somewhere. This was in the process of trying to decide if I could trust sending delta backups to a network remote rather than using full backups to a local remote. My initial feelings were that the delta backups didn't work reliably but now I believe the issue was with NFS, not with deltas specifically.

flakpyro

@CodeMercenary I am using V4 on the XO-Server to our backup remotes and it seems to work just fine. However using V4 as a storage SR was nothing but problems, as @nikade mentioned we had tons of NFS Server not responding issues which would lock up hosts and VMs causing downtime. Since moving to v3 that hasn't happened.

Checking a host's NFS retransmissions stats after 9 days of uptime i see we have had some retransmissions but they have not caused any downtime or even any timeout messages to appear in dmesg on the host.

[xcpng-prd-02 ~]# nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
268513028   169        268537542

From what a gather from this blog post from redhat (https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat) it seems like that amount of retransmissions is VERY low and not an issue.

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@CodeMercenary I am using V4 on the XO-Server to our backup remotes and it seems to work just fine. However using V4 as a storage SR was nothing but problems, as @nikade mentioned we had tons of NFS Server not responding issues which would lock up hosts and VMs causing downtime. Since moving to v3 that hasn't happened.

Checking a host's NFS retransmissions stats after 9 days of uptime i see we have had some retransmissions but they have not caused any downtime or even any timeout messages to appear in dmesg on the host.
[xcpng-prd-02 ~]# nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
268513028   169        268537542
From what a gather from this blog post from redhat (https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat) it seems like that amount of retransmissions is VERY low and not an issue.

Thats fine, we've got a lot more and I haven't seen any "nfs server not responding" in dmesg yet.
Using NFS v3 for both SR and backups now for a couple of years and it's been great, I think I had issues once or twice in like 5-6 years on the backup SR where the vhd file got locked by dom0, Vates helped out there as always and it was resolved quickly.