Midget

Midget

So I burnt it all down to ashes. Completely redid the storage. Reinstalled XCP-ng. Let's see what happens...

Midget

@olivierlambert
I guess I could build a TrueNAS quick. Maybe after my vacation.

Midget

Well, it appears the SSD I was using for the hypervisor died. So now I’m reinstalling XCP onto what was the Master on a “new” SSD. Good thing we have no shortage of hardware in our lab lol.

Midget

I let the environment calm down. And let things get back to normal. Gave it a few minutes and pulled out the Master. Which was XCP-HOST2.

It's been about 5 minutes, just checked XOA, and the cluster is gone. None of the VM's, nothing. How long should master selection take? I'll give it another 10 or so minutes before slotting the server back in place.

EDIT
I just noticed the XOSTOR no longer exists either...

Midget

@john-c Oh you mean literally pull the power on the entire lab? I guess I could do that. Although our DC has dual 16kVA UPS', dual 600 amp DC plants, and dual generators. So it would take a lot for that building to go dark. But it's a valid test.

Midget

When I stated power failure it was a reference to a test for a small scale style action to simulate what would happen if the data centre were to lose power.

I was already in progress of pulling a sled when you posted. BUT, the chassis only has 2 power supplies. Each individual server does not. So that wouldn't work. I mean, I guess I could power a host down individually. I'll add that to the tests as well.

Midget

So I pulled one of the sleds. One of the servers I mean from the chassis. I have 3 hosts in the cluster and one stand alone.

Standalone

XCP-HOST1

Cluster

XCP-HOST2 (Master)
XCP-HOST3
XCP-HOST4

Each host has a Debian VM on it. I pulled the sled for Host 4. And it was from what I can tell, a success. The Debian VM that was on Host 4 moved to host 3 on its own. And I noticed the XOSTOR dropped down to 10TB roughly. So it noticed the drives gone.

After checking everything I then slotted the server back in place, and it rejoined the pool. I even migrated the VM back to it's home server after it was part of the pool again.

I think the next trick will be to pull the master and see what happens. In theory it should elect a new master and then spin up the VM someplace else. I'm going to give it about 10 more minutes to soak after doing all that and then pull the master. I will report back.

Midget

@Midget The XCP-ng 8.2 LTS (for Current Production) along with Xen Orchestra has lots of features. Including at least some that can't be found at all in VMware products and also in Proxmox. Plus Vates is very responsive and willing to work on additions either alone or with your employer's Development team.

Oh I know. The team here has been phenomenal in helping me setup my lab environment for XCP-ng with XOA.

But we have a lot needs we need to make sure work. Which I plan on discussing here after I put it through it's paces.

Midget

@john-c Well when I get my quote after jumping through their hoops, I'll let ya'll know how expensive it is roughly. We have to license 576 cpu cores. We're getting quotes for Standard and Cloud Foundation.

I have yet to seriously try Proxmox, but that is next after I am done testing XCP-ng. I have till September of 2025, which seems like a long time, but it goes quick when trying to find a replacement for VSphere.

Midget

I finally have my environment built. Again lol. I want to test the HA. First a little about my setup...

Chassis - This is a supermicro "Fat Twin". Think of it like a mini blade server. There are 4 hosts inside this single chassis. All 4 hosts are powered by dual PSUs. So pulling the power will kill the entire chassis. Not just a single host.
Hosts - One host is a standalone. I will be keeping XOA on that host. The remaining 3 hosts are in a cluster. All hosts are identical. Dual Xeon L5630's I think, with a single NIC and 48GB of memory. I have a test Debian VM on each host setup identically.
Storage - Each host has a single SSD which is where XCP-ng is installed. And 2 6TB HDD's that are in an XOSTOR totaling 16TB. Except the standalone. HA is enabled on the cluster as well as the test VM's.

How shall we test the HA on this? Last time I pulled a network cable and the entire thing went haywire. The storage across all nodes became inaccessible. We also pulled drives out of the XOSTOR, but that did nothing. So that was good. But no alerts as far as drive failures or loss of storage space. The only other thing I can imagine doing is pulling one of the hosts out while it's running. That should simulate an entire node becoming unreachable. I'll await anyone's suggestions.

Midget

@olivierlambert Correct Sir. I apologize for wasting your time.

Midget

@Midget

Best posts made by Midget

Latest posts made by Midget