d1rtym0nk3y

d1rtym0nk3y

setting static max = dynamic min = dynamic max did not fix it

a toolstack restart on all hosts in the cluster seems to have resolved the issue, I have not had a chance to dig in to find a root cause yet

d1rtym0nk3y

This looks like it was a problem with the VDI's on a network shared storage.
Unpausing the VM's would "start" them but they never actually boot - no console would appear.

In the end, migrating all other VM off the host and then rebooting the host cleared whatever was causing these VMs to be stuck and we were able to delete them

We're currently trialing a new storage provider, so this is definitely something we'll be looking into more with their support.

d1rtym0nk3y

We have xoa (enterprise) installed.
at some point updating on latest channel it started throwing a warning about node and npm versions. Check XOA output says...

Node version: v16.13.2 does not satisfies ^14.15.4
xo-server config syntax
Disk space for /
Disk space for /var
Native SMB support
Fetching VM UUID
XOA version
Appliance registration
local SSH server
Internet connectivity
npm version: 8.4.0
does not satisfies ^6.14.9
XOA status

We've not noticed any issues, so just wondered if we can ignore this ?

Details of current versions

node: 16.13.2
npm: 8.4.0
xen-orchestra-upload-ova: 0.1.4
xo-server: 5.87.0
xo-server-auth-github-enterprise: 0.2.2
xo-server-auth-google-enterprise: 0.2.2
xo-server-auth-ldap-enterprise: 0.10.4
xo-server-auth-saml-enterprise: 0.9.0
xo-server-backup-reports-enterprise: 0.16.10
xo-server-netbox-enterprise: 0.3.3
xo-server-telemetry: 0.5.0
xo-server-transport-email-enterprise: 0.6.0
xo-server-transport-icinga2-enterprise: 0.1.1
xo-server-transport-nagios-enterprise: 0.1.1
xo-server-transport-slack-enterprise: 0.0.0
xo-server-transport-xmpp-enterprise: 0.1.1
xo-server-usage-report-enterprise: 0.10.0
xo-server-xoa: 0.12.0
xo-web-enterprise: 5.92.0
xoa-cli: 0.28.0
xoa-updater: 0.39.0

d1rtym0nk3y

Looks interesting!

We started using the xo terraform provider around 12 months ago, and then built a small http service (node/typescript) that talks to the xo-api to generate our ansible inventory. We've been using both in production since then, and i'll share some of the details for you here.

We took the approach of implementing this as a service on our network and then leveraging ansible's ability to execute a shell script to retrieve the inventory.

In our environment, we decided it was ok for the inventory to only include vm's (or hosts) that have an ip address - i mean if they don't ansible can't really work with them so thats ok for us. So the inventory service has a couple of env vars to provide a filter for which entities and ips to pick

 // no tag required by default 
 required_tag: env.get('REQUIRED_TAG').default('').asString(),
 // any ip is valid for the inventory
 management_subnet: env.get('MANAGEMENT_SUBNETS').default('0.0.0.0/0').asArray(),

First off we can require any vm or host to have a tag, e.g. ansible_managed:true to appear in the inventory
Then it must have an ip in our management subnet, if more than one ip is available (e.g. management and public) the service will filter them.

The http api for the inventory service uses the same filtering as xen-orchestra, so we can construct urls to retrieve partial inventories. This is useful for example as we have dev, production, etc, pools, and it gives us an easy way to target

https://inventory.internal/inventory?filter=env:monitoring%20mytag:foo

The response for the above request would look like this

{
   "all":{
      "hosts":[
         "monitoring-1.internal"
      ]
   },
   "_meta":{
      "hostvars":{
         "monitoring-1.internal":{
            "mytag":"foo",
            "ansible_group":"prometheus",
            "env":"monitoring",
            "inventory_name":"monitoring-1.internal",
            "ansible_host":"10.0.12.51",
            "xo_pool":"monitoring-pool",
            "xo_type":"VM",
            "xo_id":"033f8b6d-88e2-92e4-3c3e-bcaa01213772"
         }
      }
   },
   "prometheus":{
      "hosts":[
         "monitoring-1.internal"
      ]
   }
}

This vm has these tags in xen orchestra

ansible_group can be repeated, and places the vm/host into this group in the inventory. Other tags get split into key=value and placed into the host vars

the xo_* are added from the info in the api
ansible_host will be our management ip
inventory_name is a slugified version of the vm name, but by convention our names are sane

We also include hosts in the inventory, as we have various playbooks to run against them. All the same tagging and grouping applies to hosts as it does to VM's

{
   ...
      "hostvars":{
         "xcp-001":{
            "ansible_group":"xen-host",
            "inventory_name":"xcp-001",
            "ansible_host":"10.0.55.123",
            "xo_pool":"monitoring-pool",
            "xo_type":"host",
            "xo_id":"92c1c2ab-fd1e-46e9-85f7-70868f1e9106",
            "xo_version":"8.2.0",
            "xo_product":"XCP-ng"
         }
      }
   ...
}

When we setup some infra for management by terraform/ansible we'll typically use a combination of shell script inventory, static grouping and then extra group_vars if needed. For example our /inventory directory

01_inventory.sh

#!/bin/bash
curl -k https://inventory.internal/inventory?filter=k8s-cluster:admin 2>/dev/null

02_kubespray - which has its own group name convention, so we map them between our tags and their group names

[kube-master:children]
k8s-master

[etcd:children]
k8s-master

[kube-node:children]
k8s-node
k8s-monitoring

[k8s-cluster:children]
kube-master
kube-node

Executing ansible-playbook -i /inventory where /inventory is a directory will then combine all the shell scripts and ini files to make the final inventor . nice!

I did think about trying to package this api directly as a plugin for xo, but haven't had time to look into that yet. But let me know if any of this looks interesting.

d1rtym0nk3y

setting static max = dynamic min = dynamic max did not fix it

a toolstack restart on all hosts in the cluster seems to have resolved the issue, I have not had a chance to dig in to find a root cause yet

d1rtym0nk3y

A colleague was having issues with a VM and rebooted it via the XO ui.
Now it fails to start with the error message below.

I have tried a few things

"Start On" various nodes in the cluster, same error
toolstack restart on the pool master
xe vm-reset-powerstate force=true uuid=...

But so far, same error everytime I try to start the VM.
Has anyone seen this error before can point towards a possible cause ?

vm.start
{
  "id": "d0ede814-9dd5-a932-c540-3070a03f8c72",
  "bypassMacAddressesCheck": false,
  "force": false
}
{
  "code": "INTERNAL_ERROR",
  "params": [
    "xenopsd internal error: Memory_interface.Memory_error([S(Internal_error);S((Sys_error \"Broken pipe\"))])"
  ],
  "call": {
    "method": "VM.start",
    "params": [
      "OpaqueRef:a4169698-c3e4-4bd8-ac0b-e4d826d9ce3b",
      false,
      false
    ]
  },
  "message": "INTERNAL_ERROR(xenopsd internal error: Memory_interface.Memory_error([S(Internal_error);S((Sys_error \"Broken pipe\"))]))",
  "name": "XapiError",
  "stack": "XapiError: INTERNAL_ERROR(xenopsd internal error: Memory_interface.Memory_error([S(Internal_error);S((Sys_error \"Broken pipe\"))]))
    at Function.wrap (/usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/_XapiError.js:16:12)
    at /usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/transports/json-rpc.js:35:27
    at AsyncResource.runInAsyncScope (async_hooks.js:197:9)
    at cb (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/util.js:355:42)
    at tryCatcher (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/promise.js:547:31)
    at Promise._settlePromise (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/promise.js:604:18)
    at Promise._settlePromise0 (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/async.js:102:5)
    at Immediate.Async.drainQueues [as _onImmediate] (/usr/local/lib/node_modules/xo-server/node_modules/bluebird/js/release/async.js:15:14)
    at processImmediate (internal/timers.js:464:21)
    at process.topLevelDomainCallback (domain.js:147:15)
    at process.callbackTrampoline (internal/async_hooks.js:129:24)"
}

d1rtym0nk3y

This looks like it was a problem with the VDI's on a network shared storage.
Unpausing the VM's would "start" them but they never actually boot - no console would appear.

In the end, migrating all other VM off the host and then rebooting the host cleared whatever was causing these VMs to be stuck and we were able to delete them

We're currently trialing a new storage provider, so this is definitely something we'll be looking into more with their support.

d1rtym0nk3y

@darkbeldin

I have tried on one of the hosts, but makes no difference

d1rtym0nk3y

@olivierlambert

I've tried that - i think this is the correct way ..

[12:54 LP1-XS-002 log]# xe vm-reset-powerstate uuid=4f81b4ce-c681-dec2-e147-090036de1a47 force=true
This operation cannot be completed because the server is still live.
host: b72027de-5c53-4ebe-a324-60c1af946d52 (LP1-XS-002)

d1rtym0nk3y

@danp said in Unable remove VM:

I have a couple of VM that have got stuck in a paused state, on different hosts.

How did you confirm that the VM is in the wrong power state? Could be VDI instead.

By looking at the UI and seeing power state paused, and it failing to remove. I'm not sure how to check the state of a VDI. the VDI's are there, but the VM's won't boot or delete

Have you checked the logs for more details?

yes, not entirely sure what might indicate a root cause. These relate to one of the VMs in question

Jul 29 10:52:09 LP1-XS-002 xenopsd-xc: [ info||22 ||xenops_server] Caught Xenops_interface.Xenopsd_error([S(Cancelled);S(4397606)]) executing ["VM_reboot",["4f81b4ce-c681-dec2-e147-090036de1a47",[]]]: triggering cleanup actions
Jul 29 11:18:22 LP1-XS-002 xenopsd-xc: [ info||16 ||xenops_server] Caught Xenops_interface.Xenopsd_error([S(Cancelled);S(4398131)]) executing ["VM_poweroff",["4f81b4ce-c681-dec2-e147-090036de1a47",[]]]: triggering cleanup actions
Jul 29 11:53:55 LP1-XS-002 xenopsd-xc: [ info||31 ||xenops_server] Caught Xenops_interface.Xenopsd_error([S(Cancelled);S(4398834)]) executing ["VM_poweroff",["4f81b4ce-c681-dec2-e147-090036de1a47",[]]]: triggering cleanup actions
Jul 29 12:17:34 LP1-XS-002 xapi: [ warn||7833402 INET :::80|Async.VM.unpause R:c20f65c0d932|xenops] Potential problem: VM 4f81b4ce-c681-dec2-e147-090036de1a47 in power state 'paused' when expecting 'running'

Have you checked under Dashboard > Health to ensure there aren't any VDIs attached to the Control Domain?

No VDI's attached to the control domain

P.S. I previously used the method shown here to clear up this issue with a VDI

d1rtym0nk3y

I have a couple of VM that have got stuck in a paused state, on different hosts.
Force shutdown in XO eventually times out.

I've tried following the instructions here -
https://support.citrix.com/article/CTX220777

HA is not enabled on the pool

[11:33 LP1-XS-002 ~]# xe vm-shutdown uuid=4f81b4ce-c681-dec2-e147-090036de1a47 force=true
^C[11:34 LP1-XS-002 ~]# xe vm-reset-powerstate uuid=4f81b4ce-c681-dec2-e147-090036de1a47 force=true
This operation cannot be completed because the server is still live.
host: b72027de-5c53-4ebe-a324-60c1af946d52 (LP1-XS-002)
[11:34 LP1-XS-002 ~]# list_domains | grep 4f8
73 | 4f81b4ce-c681-dec2-e147-090036de1a47 | D P  H
[11:34 LP1-XS-002 ~]# xl destroy 73
libxl: error: libxl_xshelp.c:201:libxl__xs_read_mandatory: xenstore read failed: `/libxl/73/type': No such file or directory
libxl: warning: libxl_dom.c:54:libxl__domain_type: unable to get domain type for domid=73, assuming HVM

Anyone got any clues how to resolve this - I don't need these vm's they are just getting deleted, so a hard kill on them is fine.

d1rtym0nk3y

We'd love to see the ability to configure the default columns when viewing lists of VM's, Hosts or any object, but mainly the first two.

For us, Description is never used, but tags are important. To view tags we have to expand all the hosts which causes the ui to slow down constantly reflow while it loads all the stats charts.

It's be great to have the ability to set the default columns something like

Name | IPs | Tags | Pool

d1rtym0nk3y

@d1rtym0nk3y

Best posts made by d1rtym0nk3y

Latest posts made by d1rtym0nk3y