@olivierlambert
I guess I could build a TrueNAS quick. Maybe after my vacation.
Posts
-
RE: Let's Test the HA
-
RE: Let's Test the HA
Well, it appears the SSD I was using for the hypervisor died. So now Iโm reinstalling XCP onto what was the Master on a โnewโ SSD. Good thing we have no shortage of hardware in our lab lol.
-
RE: Let's Test the HA
I let the environment calm down. And let things get back to normal. Gave it a few minutes and pulled out the Master. Which was XCP-HOST2.
It's been about 5 minutes, just checked XOA, and the cluster is gone. None of the VM's, nothing. How long should master selection take? I'll give it another 10 or so minutes before slotting the server back in place.
EDIT
I just noticed the XOSTOR no longer exists either... -
RE: Let's Test the HA
@john-c Oh you mean literally pull the power on the entire lab? I guess I could do that. Although our DC has dual 16kVA UPS', dual 600 amp DC plants, and dual generators. So it would take a lot for that building to go dark. But it's a valid test.
-
RE: Let's Test the HA
When I stated power failure it was a reference to a test for a small scale style action to simulate what would happen if the data centre were to lose power.
I was already in progress of pulling a sled when you posted. BUT, the chassis only has 2 power supplies. Each individual server does not. So that wouldn't work. I mean, I guess I could power a host down individually. I'll add that to the tests as well.
-
RE: Let's Test the HA
So I pulled one of the sleds. One of the servers I mean from the chassis. I have 3 hosts in the cluster and one stand alone.
Standalone
- XCP-HOST1
Cluster
- XCP-HOST2 (Master)
- XCP-HOST3
- XCP-HOST4
Each host has a Debian VM on it. I pulled the sled for Host 4. And it was from what I can tell, a success. The Debian VM that was on Host 4 moved to host 3 on its own. And I noticed the XOSTOR dropped down to 10TB roughly. So it noticed the drives gone.
After checking everything I then slotted the server back in place, and it rejoined the pool. I even migrated the VM back to it's home server after it was part of the pool again.
I think the next trick will be to pull the master and see what happens. In theory it should elect a new master and then spin up the VM someplace else. I'm going to give it about 10 more minutes to soak after doing all that and then pull the master. I will report back.
-
RE: Dell cancels VMWare contract after Broadcom purchase
@Midget The XCP-ng 8.2 LTS (for Current Production) along with Xen Orchestra has lots of features. Including at least some that can't be found at all in VMware products and also in Proxmox. Plus Vates is very responsive and willing to work on additions either alone or with your employer's Development team.
Oh I know. The team here has been phenomenal in helping me setup my lab environment for XCP-ng with XOA.
But we have a lot needs we need to make sure work. Which I plan on discussing here after I put it through it's paces.
-
RE: Dell cancels VMWare contract after Broadcom purchase
@john-c Well when I get my quote after jumping through their hoops, I'll let ya'll know how expensive it is roughly. We have to license 576 cpu cores. We're getting quotes for Standard and Cloud Foundation.
I have yet to seriously try Proxmox, but that is next after I am done testing XCP-ng. I have till September of 2025, which seems like a long time, but it goes quick when trying to find a replacement for VSphere.
-
Let's Test the HA
I finally have my environment built. Again lol. I want to test the HA. First a little about my setup...
-
Chassis - This is a supermicro "Fat Twin". Think of it like a mini blade server. There are 4 hosts inside this single chassis. All 4 hosts are powered by dual PSUs. So pulling the power will kill the entire chassis. Not just a single host.
-
Hosts - One host is a standalone. I will be keeping XOA on that host. The remaining 3 hosts are in a cluster. All hosts are identical. Dual Xeon L5630's I think, with a single NIC and 48GB of memory. I have a test Debian VM on each host setup identically.
-
Storage - Each host has a single SSD which is where XCP-ng is installed. And 2 6TB HDD's that are in an XOSTOR totaling 16TB. Except the standalone. HA is enabled on the cluster as well as the test VM's.
How shall we test the HA on this? Last time I pulled a network cable and the entire thing went haywire. The storage across all nodes became inaccessible. We also pulled drives out of the XOSTOR, but that did nothing. So that was good. But no alerts as far as drive failures or loss of storage space. The only other thing I can imagine doing is pulling one of the hosts out while it's running. That should simulate an entire node becoming unreachable. I'll await anyone's suggestions.
-
-
RE: HA Operation Would Break Failover Plan
@olivierlambert Correct Sir. I apologize for wasting your time.
-
RE: HA Operation Would Break Failover Plan
It's been a long week. I have wasted your time. I never installed the guest tools.../facepalm
EDIT
I realized it right after I ran that command. HA is now enabled. -
HA Operation Would Break Failover Plan
I got my environment back up and running. Reloaded all 3 hosts, wiped the drives, rebuilt the XOSTOR, and enabled HA. I built a simple Debian 12 VM on one of the hosts using the XOSTOR. But when I try to enable HA in the advance options tab, I get the output I've posted below. What am I doing wrong now?
vm.set { "high_availability": "restart", "id": "62dde87b-5dbf-119f-07aa-94434ca348b3" } { "code": "HA_OPERATION_WOULD_BREAK_FAILOVER_PLAN", "params": [], "call": { "method": "VM.set_ha_restart_priority", "params": [ "OpaqueRef:207f8ee7-3b24-405f-a021-0a7a35b3a7d5", "restart" ] }, "message": "HA_OPERATION_WOULD_BREAK_FAILOVER_PLAN()", "name": "XapiError", "stack": "XapiError: HA_OPERATION_WOULD_BREAK_FAILOVER_PLAN() at Function.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12) at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/transports/json-rpc.mjs:35:21 at runNextTicks (node:internal/process/task_queues:60:5) at processImmediate (node:internal/timers:447:9) at process.callbackTrampoline (node:internal/async_hooks:130:17)" }
-
RE: XOSTOR Creation Issues
Quick update. I ran this command for each drive on each host...
wipefs --all --force /dev/sdX
Then tried building the XOSTOR again. This time I got this an error in the XOSTOR page that some random UUID already had XOSTOR on it, but it built the XOSTOR? I have no idea how, or what happened, but it did.
So I have my XOSTOR back.
-
RE: XOSTOR Creation Issues
So I burnt it all down. I thought it was going to go through. But it didn't create the XOSTOR, but I have this log...
xostor.create { "description": "Test Virtual SAN Part 2", "disksByHost": { "e9b5aa92-660c-4dad-98c7-97de52556f22": [ "/dev/sdb", "/dev/sdc" ], "eb4cab8c-2234-4c7f-af84-d1b1494da60e": [ "/dev/sdb", "/dev/sdc" ], "68b9dc54-0bf3-4dc0-854f-d4cdabb47c23": [ "/dev/sdb", "/dev/sdc" ] }, "name": "XCP Storage 2", "provisioning": "thick", "replication": 2 } { "code": "SR_UNKNOWN_DRIVER", "params": [ "linstor" ], "call": { "method": "SR.create", "params": [ "e9b5aa92-660c-4dad-98c7-97de52556f22", { "group-name": "linstor_group/thin_device", "redundancy": "2", "provisioning": "thick" }, 0, "XCP Storage 2", "Test Virtual SAN Part 2", "linstor", "user", true, {} ] }, "message": "SR_UNKNOWN_DRIVER(linstor)", "name": "XapiError", "stack": "XapiError: SR_UNKNOWN_DRIVER(linstor) at Function.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12) at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/transports/json-rpc.mjs:35:21 at runNextTicks (node:internal/process/task_queues:60:5) at processImmediate (node:internal/timers:447:9) at process.callbackTrampoline (node:internal/async_hooks:130:17)" }
And after this I got an alert the pool needed to be updated again. So I did the updates, rebooted the hosts, and tried to make the XOSTOR again. This time I got this...
xostor.create { "description": "Test Virtual SAN Part 2", "disksByHost": { "e9b5aa92-660c-4dad-98c7-97de52556f22": [ "/dev/sdb", "/dev/sdc" ], "eb4cab8c-2234-4c7f-af84-d1b1494da60e": [ "/dev/sdb", "/dev/sdc" ], "68b9dc54-0bf3-4dc0-854f-d4cdabb47c23": [ "/dev/sdb", "/dev/sdc" ] }, "name": "XCP Storage 2", "provisioning": "thick", "replication": 2 } { "errors": [ { "code": "LVM_ERROR(5)", "params": [ "File descriptor 3 (/var/log/lvm-plugin.log) leaked on pvcreate invocation. Parent PID 5262: python File descriptor 9 (/dev/urandom) leaked on pvcreate invocation. Parent PID 5262: python Can't initialize physical volume \"/dev/sdb\" of volume group \"linstor_group\" without -ff /dev/sdb: physical volume not initialized. Can't initialize physical volume \"/dev/sdc\" of volume group \"linstor_group\" without -ff /dev/sdc: physical volume not initialized. ", "", "", "[XO] This error can be triggered if one of the disks is a 'tapdevs' disk.", "[XO] This error can be triggered if one of the disks have children" ], "call": { "method": "host.call_plugin", "params": [ "OpaqueRef:fd2fcfdf-576b-4ea9-b4ac-20e91e1b4bbd", "lvm.py", "create_physical_volume", { "devices": "/dev/sdb,/dev/sdc", "ignore_existing_filesystems": "false", "force": "false" } ] } }, { "code": "LVM_ERROR(5)", "params": [ "File descriptor 3 (/var/log/lvm-plugin.log) leaked on pvcreate invocation. Parent PID 4884: python File descriptor 9 (/dev/urandom) leaked on pvcreate invocation. Parent PID 4884: python Can't initialize physical volume \"/dev/sdb\" of volume group \"linstor_group\" without -ff /dev/sdb: physical volume not initialized. Can't initialize physical volume \"/dev/sdc\" of volume group \"linstor_group\" without -ff /dev/sdc: physical volume not initialized. ", "", "", "[XO] This error can be triggered if one of the disks is a 'tapdevs' disk.", "[XO] This error can be triggered if one of the disks have children" ], "call": { "method": "host.call_plugin", "params": [ "OpaqueRef:057c701d-7d4a-4d59-8a36-db0a0ef65960", "lvm.py", "create_physical_volume", { "devices": "/dev/sdb,/dev/sdc", "ignore_existing_filesystems": "false", "force": "false" } ] } }, { "code": "LVM_ERROR(5)", "params": [ "File descriptor 3 (/var/log/lvm-plugin.log) leaked on pvcreate invocation. Parent PID 4623: python File descriptor 9 (/dev/urandom) leaked on pvcreate invocation. Parent PID 4623: python Can't initialize physical volume \"/dev/sdb\" of volume group \"linstor_group\" without -ff /dev/sdb: physical volume not initialized. Can't initialize physical volume \"/dev/sdc\" of volume group \"linstor_group\" without -ff /dev/sdc: physical volume not initialized. ", "", "", "[XO] This error can be triggered if one of the disks is a 'tapdevs' disk.", "[XO] This error can be triggered if one of the disks have children" ], "call": { "method": "host.call_plugin", "params": [ "OpaqueRef:48af9637-fc0f-402b-94da-64eac63d31f8", "lvm.py", "create_physical_volume", { "devices": "/dev/sdb,/dev/sdc", "ignore_existing_filesystems": "false", "force": "false" } ] } } ], "message": "", "name": "Error", "stack": "Error: at next (/usr/local/lib/node_modules/xo-server/node_modules/@vates/async-each/index.js:83:24) at onRejected (/usr/local/lib/node_modules/xo-server/node_modules/@vates/async-each/index.js:65:11) at onRejectedWrapper (/usr/local/lib/node_modules/xo-server/node_modules/@vates/async-each/index.js:67:41)" }
-
RE: XOSTOR Creation Issues
So I burnt it all down to ashes. Completely redid the storage. Reinstalled XCP-ng. Let's see what happens...
-
RE: XOSTOR Creation Issues
@learningdaily I have to believe there is a way to fix this. Maybe once I get the time I will reload everything. It won't take long I guess.
-
RE: XOSTOR Creation Issues
@learningdaily said in XOSTOR Creation Issues:
@Midget I believe the linstor manager only runs on one XCP-ng Host at a time. So if you ssh to each of your XCP-ng hosts, and run the command:
linstor resource list
The XCP-ng Host running the linstor manager would display the expected results. The other XCP-ng Hosts will display an error similar to what you saw.
Prior to implementing your fix, did you attempt the command from each XCP-ng host and what were the results?
I'd recommend undoing the 127.0.0.1 change and attempting from each host.
I haven't implemented any fix. It was just something I read.