Switching to XCP-NG, want to hear your problems

flakpyro

@rfx77 Makes sense. I was thinking of giving the Commvault trial a try with XCP-NG since it looked like while they don't use Xens' CBT, they still tracked changed blocks within their helper VM while still doing dedupe.

Backups have been the biggest set back in our move from VMware, i knew going in i would miss Veeam more than i'd miss Vmware itself.

As for NFS 4.1 vs 3, if the timeouts return this week i think i will give v3 a try if it worked more reliability for you.

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@nikade I am not using any custom mount options other than "hard" to do a hard nfs mount to prevent data lose when drops like these happen.

Did you have to use custom mount options with V3 as well then or just with V4? I may try moving VMs over to a V3 mount from V4 to see if that helps stabilize things.

In XOA we got some mount options from @yannik but in XCP we have not had to use any special options when mounting the NFS SR (as long as we're using NFS 3).

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@rfx77 Makes sense. I was thinking of giving the Commvault trial a try with XCP-NG since it looked like while they don't use Xens' CBT, they still tracked changed blocks within their helper VM while still doing dedupe.

Backups have been the biggest set back in our move from VMware, i knew going in i would miss Veeam more than i'd miss Vmware itself.

As for NFS 4.1 vs 3, if the timeouts return this week i think i will give v3 a try if it worked more reliability for you.

Yeah, Veeam really is the king of backups. We're backing up about 50 VM's with Veeam from our vmware clusters and man it is sooo fast and reliable, i've seen backups go at 7Gbit/s which is incredible.

flakpyro

@nikade Yeah Veeam was very set and forget, we backed up around 100 VMs a night locally and to our DR site and it just reliably worked and i never really had to think about it. Once CBT stabilizes in XCP-NG i think that will go a long way in helping but i don't think its quite production ready yet.

Im hoping to eventually get to the same point with XCP-NG, be it with XOA backups or with something like commvault. Our NFS mounts last dropped Thursday night during backups and have been fine since so its VERY intermittent. I think if it happens again i will begin moving to NFS3 backed SRs and hope that solves it.

What issues with commvault did you run into due to its lack of native Xen CBT? Was their own internal change block tracking not reliable? I have a feeling the xen portion of the product does not see a lot of development attention from reading their documentation.

nikade

@flakpyro I never tried commvault so I cant really tell, but I've tried Acronis and Quadric but performance wasn't too great, atleast not better than XOA so there was no point.

rfx77

@flakpyro

Our Issue with commvault was that despite it mounts the snapshots in the proxy VM it has to read them as a whole. So Inc Backups take nearly as long as Full backups. When you want to backup 20TB+ Incs every night, Thats not acceptable. So the Lack of CBT is really a big deal when doing VM Backups.

Performance with CommVault would be not that big of a problem since you can do multiple streams and multiple VMs simultanously but we could not get past 600MB/s which is a Xen Problem as of our Testing. With HyperV we see Performance in access of 1.6GB/s in the same Scenarios. But also this performance would be much to slow to do Incs every night.

We tested XO CBT but it was not stable and for what i read in the posts there seems to be problems with CBT and Live-Migration so that the CBT state seems to be lost. Also a big Problem for us.

When we used CommVault to backup multiple VMs we ran into Blue-Screens of the Xen Toolstack in the Windows VMs when attaching or detaching Snapshot-VDIs to the Proxy VMs. There clearly is a bug in the Xen Windows Drivers. So we had to reduce concurrency which reduced backup speed.

We didn't do a short test with xen, we migrated out internal production Cluster to it from VMWare and used it for about 4 month now (30+ VMs on iSCSI Storage with 3 Nodes) and after we ran into more and more Problems we had many discussions with our team and we had to ask ourselfes what Xen brings to the table that is worth the drawbacks. The only scenario where it fits for us is where we have to map physical hardware into VMs.

We decided to come back in some time to see how the SMAPIv3 drivers work out and if there is a better support for shared storage.

flakpyro

@rfx77 As a follow up v4 did cause more downtime for us, i switched over everything to v3 which has been much better so far. Going to be curious to see how a controller failover goes during a firmware update as v3 is stateless vs v4 and iscsi being stateful protocols.

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@rfx77 As a follow up v4 did cause more downtime for us, i switched over everything to v3 which has been much better so far. Going to be curious to see how a controller failover goes during a firmware update as v3 is stateless vs v4 and iscsi being stateful protocols.

We're also using v3 and failover on our Dell Powerstore's are seamless.
Haven't tried v4 since we had A LOT of "nfs server not responding" issues with it and immediately went back to v3.

CodeMercenary

Interesting to know that v3 seems to be more reliable that v4. I had repeated problems with using NFS for a backup remote and those problems only went away when I changed the remotes to use SMB. I know NFS would be better to use but a backup that happens through an inferior protocol is way better than one that fails using a better protocol.

Maybe I should give NFS another chance but force it to use v3.

In my case, I'd get backups working on NFS and then several days later a backup would fail. Then backups fail every day until I intervene, usually by rebooting the XO VM. Sometimes I'd then have to do cleanup, like releasing a VDI or something. Then it may or may not start working again but if it did start working I'd have another failure a few days later. It's been 2.5 weeks since I switched it to SMB and have had no failures. That's definitely the longest I've gone without a failure from a delta backup to a networked drive.

Note, I also have backups going to a local drive mounted in XO so with all those remote failures I always had a clean backup somewhere. This was in the process of trying to decide if I could trust sending delta backups to a network remote rather than using full backups to a local remote. My initial feelings were that the delta backups didn't work reliably but now I believe the issue was with NFS, not with deltas specifically.

flakpyro

@CodeMercenary I am using V4 on the XO-Server to our backup remotes and it seems to work just fine. However using V4 as a storage SR was nothing but problems, as @nikade mentioned we had tons of NFS Server not responding issues which would lock up hosts and VMs causing downtime. Since moving to v3 that hasn't happened.

Checking a host's NFS retransmissions stats after 9 days of uptime i see we have had some retransmissions but they have not caused any downtime or even any timeout messages to appear in dmesg on the host.

[xcpng-prd-02 ~]# nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
268513028   169        268537542

From what a gather from this blog post from redhat (https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat) it seems like that amount of retransmissions is VERY low and not an issue.

nikade

@flakpyro said in Switching to XCP-NG, want to hear your problems:

@CodeMercenary I am using V4 on the XO-Server to our backup remotes and it seems to work just fine. However using V4 as a storage SR was nothing but problems, as @nikade mentioned we had tons of NFS Server not responding issues which would lock up hosts and VMs causing downtime. Since moving to v3 that hasn't happened.

Checking a host's NFS retransmissions stats after 9 days of uptime i see we have had some retransmissions but they have not caused any downtime or even any timeout messages to appear in dmesg on the host.
[xcpng-prd-02 ~]# nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
268513028   169        268537542
From what a gather from this blog post from redhat (https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat) it seems like that amount of retransmissions is VERY low and not an issue.

Thats fine, we've got a lot more and I haven't seen any "nfs server not responding" in dmesg yet.
Using NFS v3 for both SR and backups now for a couple of years and it's been great, I think I had issues once or twice in like 5-6 years on the backup SR where the vhd file got locked by dom0, Vates helped out there as always and it was resolved quickly.