Shared Storage Redundancy Testing

mauzilla

We're busy with some redundancy testing in our test bench for network storage to be added. We plan to move our local storage to shared storage on TrueNAS servers with NFS shares connected to our pool. We will have 2 TrueNAS servers with identical pool names and NFS sharing (so ultimately it will only be a single IP Address that changes). We will then add to the pool a single NFS share pointing to the primary TrueNAS server. In the event where this server were to fail (due to whatever reason), we would like to simply change the IP address of the share in the pool so that it connects to the secondary shared pool which in theory should be an identical dataset to the primary share.

In our test bench we have setup a pool with 2 hosts. We also have 2 TrueNAS servers already configured which is replicating a set of test VM's to eachother.

Our initial experience has been a bit strange. Even after shutting down the TrueNAS server, in the pool it still stays "connected" - VM's keep running (albeit does not have any storage connected to it so only what is left in RAM). We forcefully shutdown all of the VM's (this is a test bench so we want to replicate a real world scenario where we need to switch to the failover storage). In the pool going to the storage, the storage stays connected but we're unable to disconnect it even without any VM's running on it. I suspect this is due to it being unable to "disconnect" from the NFS point as the actual server is offline.

This leaves us with a bit of a problem and hoping others can help here:

As both TrueNAS servers should be identical in storage (the NFS point points and pools are called exactly the same), we figured it would be as simple as changing the NFS shared storage IP to point to the new server, but this seems to be problematic. What would be the best way to achieve this to simply update the IP address of the storage?

Forza

@mauzilla it's unfortunately not possible to change the IP or dns of an existing SR. It has to be dropped and recreated. I suppose you can swap the IP addresses on the NFS servers, but even that may not help.

mauzilla

@Forza I assume we should recreate the PBD or is there another way to achieve the above? It seems like a realworld issue someone may be faced with if you replicate your NAS to a secondary as a failover?

olivierlambert

PBD remove and recreate will do the trick, no need to remove the SR.

mauzilla

@olivierlambert stupid question but would that then be "busy as usual" (AKA if the storage has the replicated data on it (or some version / snapshot) in theory after the PBD recreation the VM's will automatically pick up their individual vhds?

olivierlambert

You can't remove a PBD as long as you have one running disk on it. So you'll need to migrate or shutdown any VM using a disk on it, then delete the PBD and recreate it. You can use the SR maintenance mode button in XO to make it easier.

Forza

@olivierlambert said in Shared Storage Redundancy Testing:

You can't remove a PBD as long as you have one running disk on it. So you'll need to migrate or shutdown any VM using a disk on it, then delete the PBD and recreate it. You can use the SR maintenance mode button in XO to make it easier.

Does this mean the mapping between the disk, snapshots and VM is preserved?

It would be great if this procedure was implemented as an easy-to-use tool in XO/XOA.

mauzilla

We will test this today and let you know. Ultimately the use case here is to be able to make use of a failover NAS (which is replicated at NAS level) so that it's a simpler process to switch to a failover in the event of failure (else there is no practical point to replicate the VHD's between external storage servers if we cannot "switch" to another NAS.

I will let you know the outcome, but agree with @Forza that if this does work it would be a great addition to the GUI to allow for a "switch to failover" scenario

nikade

@Forza said in Shared Storage Redundancy Testing:

@olivierlambert said in Shared Storage Redundancy Testing:

You can't remove a PBD as long as you have one running disk on it. So you'll need to migrate or shutdown any VM using a disk on it, then delete the PBD and recreate it. You can use the SR maintenance mode button in XO to make it easier.

Does this mean the mapping between the disk, snapshots and VM is preserved?

It would be great if this procedure was implemented as an easy-to-use tool in XO/XOA.

If the storage is 100% replicated and looks the same, the snapshots should map to the VM's correctly since the paths would be identical.

nikade

@mauzilla said in Shared Storage Redundancy Testing:

We will test this today and let you know. Ultimately the use case here is to be able to make use of a failover NAS (which is replicated at NAS level) so that it's a simpler process to switch to a failover in the event of failure (else there is no practical point to replicate the VHD's between external storage servers if we cannot "switch" to another NAS.

I will let you know the outcome, but agree with @Forza that if this does work it would be a great addition to the GUI to allow for a "switch to failover" scenario

We're using NFS and failover on a Dell Powerstore 1000T, works pretty good and the NAS presents just 1 IP so we dont have to reconfigure or unplug/plug any VBD's.
When a node is failed the VM's just continue running and the secondary node takes over within seconds so there is really nothing happening except a really short hickup.

Forza

@nikade said in Shared Storage Redundancy Testing:

@Forza said in Shared Storage Redundancy Testing:

@olivierlambert said in Shared Storage Redundancy Testing:

You can't remove a PBD as long as you have one running disk on it. So you'll need to migrate or shutdown any VM using a disk on it, then delete the PBD and recreate it. You can use the SR maintenance mode button in XO to make it easier.

Does this mean the mapping between the disk, snapshots and VM is preserved?

It would be great if this procedure was implemented as an easy-to-use tool in XO/XOA.

If the storage is 100% replicated and looks the same, the snapshots should map to the VM's correctly since the paths would be identical.

I mean't with the procedure to remove and recreate the PDB as @olivierlambert mentioned.

mauzilla

@olivierlambert, we're simulating the pbd disconnect to see what would happen in production. The NAS was shutdown (albeit the VM's were still running), we then force shutdown the VM's.

Running xe pbd-unplug is stuck (and I assume this is likely due to the Dom being unable to umount the now stale NFS mount point). This could normally be resolved (if one has access to the dom0 with a lazy unmount) but obviously we only interact with through XAPI so not sure if there is an option to achieve this?

What we're trying to do is to avoid a reboot if a NAS fails (as it may be for the entire pool and not just for 1 host). Any ideas?

olivierlambert

You can lazy umount a failed network share, then PBD unplug will work.