VM Boot Order via XO?

ThasianXi

@cichy This can be accomplished using vApps via the CLI.
Check out this forum post in which a member shared the process: XCP-ng Pool Boot Order

cichy

@ThasianXi this is a solution, not the one I was hoping for. I was explicitly hoping for an XO guy-led solution. Something similar to the way VMWare allows one to drag vm's in boot pref. @olivierlambert if this is not a feature request already, how do I make it one?

sid

@cichy I think there's more to it; namely that after a VM has booted, it might still be a while until its services are available. For example a ceph cluster, or an Elasticsearch cluster can take a while to come up, or what if there's an issue where something which should be quick to start fails to start?

So I think not only do you want to boot VMs in a certain order, but you also want a health check to pass before proceeding to the next VM.

If that is the case, then this might be out of scope of the hypervisor, and instead this could be a script that runs on a VM which is set to auto-start, and doesn't itself depend on other VMs, where it uses the API to start VMs and perform health checks before continuing to the next.

cichy

@sid this is how my K8S/K3S and Stack yml configs work. They leverage interdependencies and health checks across clusters (ping via IP and listen for reply etc) work. API or potentially webhooks could work. Though, what I’m asking/looking for is a way that XO could facilitate. Even if the “script” were added via XO-UI and I could expose apps/services/vm’s dynamically via said UI. Proxmox has something similar to this, it’s buried with VM config options within the UI.

Greg_E

@cichy

Best way to implement this from XO would be to wait until the management agent comes up and then move on to the next VMs in the boot order.

I'll have to check out the cli method and see what's involved, there are a few VMs I want to boot in order if the power goes out, things like get the AD VMs up and running so DHCP and DNS are working before other stuff gets booted.

My system is small so I don't have things like the "traditional" LAMP stack with separate application, DB, other VMs so I've never looked into this aspect. But it's right on the surface for VMware and a lot of people are going to be looking for this functionality in one way or another. The VMware drag and drop makes it easy for a Junior level admin to configure a vApp to handle this type of requirement.

cichy

@Greg_E exactly! I'm thinking of my junior/int sys-admins who are very comfortable with VMWare process (vCenter). It's dead simple, yet, even vCenter doesn't have a facility that can scan for app/service feedback as a health check. So, this would be a first.

sid

@cichy I know this isn't as easy as what you're asking for, but I wrote some terrible python code.

It relies on health checks being defined as VM tags, or at least the management agent being detected. For example in my terraform code I have these tags on a test postgres instance and test nginx instances respectively:

# postgres
  tags = [
    "bootOrder/agent-detect-timeout=45",
    "bootOrder/ip=${jsonencode("auto")}",
    "bootOrder/healtcheck/tcp=${jsonencode({
      "port" : 5432,
    })}",
  ]

# nginx
  tags = [
    "bootOrder/agent-detect-timeout=45",
    "bootOrder/ip=${jsonencode("auto")}",
    "bootOrder/healtcheck/http=${jsonencode({
      "port" : 80,
      "scheme" : "http",
      "path" : "/"
    })}",
  ]

Then the actual python:

#!/usr/bin/env python3
import urllib3
import json
import os
import sys
import socket
import time
import logging

logging.basicConfig(level=logging.INFO)

BOOT_ORDER = [
    # Postgres
    ["55e88cb4-0c50-8384-2149-cf73e40b8c8e"],
    # nginx
    ["ba620f01-69d1-ddd8-b1d4-c256abe07e05", "bbe333bd-380a-1f94-4052-881c763b6177"],
]

DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS = 60

class HealthCheck:
    def __init__(self, target: str, config: dict) -> None:
        self.type = "base"
        self.target = target
        self.config = config
        self.timeout = 3
        self.retry_max_count = 5
        self.retry_cur_count = 0
        self.retry_sleep = 10

    def _retry(self):
        if self.retry_cur_count == 0:
            logging.info("Starting %s healtcheck against %s", self.type, self.target)
            self.retry_cur_count += 1
            return True
        if self.retry_cur_count == self.retry_max_count:
            logging.warning('Failed Healtcheck of type %s for %s', self.type, self.target) 
            return False
        time.sleep(self.retry_sleep)
        self.retry_cur_count += 1
        return True


class TCPHealthCheck(HealthCheck):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.type = "TCP"

    def run(self):
        port = self.config.get("port")
        while self._retry():
            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
                sock.settimeout(self.timeout)
                success = sock.connect_ex((self.target, port)) == 0
                if success:
                    return True
        return False


class HttpHealthCheck(HealthCheck):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.type = "HTTP"

    def run(self):
        while self._retry():
            assert_hostname = self.config.get("tls_verification", True)
            http = urllib3.PoolManager(
                cert_reqs="CERT_REQUIRED" if assert_hostname else "CERT_NONE",
            )
            scheme = self.config.get("scheme", "http")
            port = self.config.get("port", 80)
            path = self.config.get("path", "").lstrip("/")
            url = f"{scheme}://{self.target}:{port}/{path}"
            response = http.request('GET', url, timeout=self.timeout)
            if response.status >= 200 and response.status < 300:
                return True
        return False

class XoaClient:
    def __init__(self, base_url: str, token: str) -> None:
        self.base_url = base_url.rstrip("/")
        self.tags_prefix = "bootOrder/"
        self.token = token
        self.http = urllib3.PoolManager()
        self.headers = {
            "Content-Type": "application/json",
            "Cookie": f"token={self.token}",
        }
        self._vm_cache = {}

    def vm_ip(self, uuid):
        vm_tags = self._extract_vm_tags(uuid)
        ip = vm_tags.get("ip", "auto")
        if ip != "auto":
            return ip
        return self._get_vm(uuid).get("mainIpAddress")

    def vm_healthcheck(self, uuid):
        vm_tags = self._extract_vm_tags(uuid)
        tcp = vm_tags.get("healtcheck/tcp")
        http = vm_tags.get("healtcheck/http")
        return tcp, http


    def _get_vm(self, uuid: str):
        url = f"{self.base_url}/rest/v0/vms/{uuid}"
        # if url in self._vm_cache:
        #     return self._vm_cache[url]
        response = self.http.request("GET", url, headers=self.headers)
        result = self._handle_json_response(response)

        self._vm_cache[url] = result
        return result

    def _extract_vm_tags(self, uuid: str) -> dict:
        dict_tags = {}
        tags = self._get_vm(uuid).get("tags")
        for tag in tags:
            if tag.startswith(self.tags_prefix):
                k,v = tag.split("=", 1)
                k = k[len(self.tags_prefix):]
                dict_tags[k] = json.loads(v)
        return dict_tags

    def start_vm(self, uuid: str):
        if self._get_vm(uuid).get("power_state") == "Running":
            return
        url = f"{self.base_url}/rest/v0/vms/{uuid}/actions/start?sync=true"
        response = self.http.request("POST", url, headers=self.headers)
        if response.status != 204:
            raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")
        return

    def management_agent_detected(self, uuid: str) -> bool:
        return self._get_vm(uuid).get("managementAgentDetected")

    def vm_agent_detection_timeout(self, uuid: str, default_seconds: int = 60) -> bool:
        tags = self._extract_vm_tags(uuid)
        return tags.get("agent-detect-timeout", default_seconds)

    def _handle_json_response(self, response):
        if response.status >= 200 and response.status < 300:
            return json.loads(response.data.decode("utf-8"))
        else:
            raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")



if __name__ == "__main__":
    xoa_url = os.getenv("XOA_URL")
    xoa_token = os.getenv("XOA_TOKEN")
    if not xoa_url:
        logging.fatal("Missing XOA_URL environment variable")
        sys.exit(1)
    if not xoa_token:
        logging.fatal("Missing XOA_TOKEN environment variable")
        sys.exit(1)
    client = XoaClient(xoa_url, xoa_token)

    group_number = 1
    for boot_group in BOOT_ORDER:
        logging.info("Starting to boot group %s, length %s", group_number, len(boot_group))
        # These should be booted in parallel, but aren't
        for uuid in boot_group:
            client.start_vm(uuid)
            timeout = client.vm_agent_detection_timeout(
                uuid=uuid,
                default_seconds=DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS,
            )
            mad = False
            for n in range(timeout):
                mad = client.management_agent_detected(uuid)
                if mad:
                    break
                time.sleep(1)
            if not mad:
                raise Exception(f"No management agent detected in host {uuid}")
            target = client.vm_ip(uuid)
            tcp, http = client.vm_healthcheck(uuid)
            if tcp:
                hc = TCPHealthCheck(target=target, config=tcp)
                hc.run()
            if http:
                hc = HttpHealthCheck(target=target, config=http)
                hc.run()
            logging.info("All healthchecks passed for %s", target)
        group_number += 1

It'll boot each VM in order and wait for its agent to be detected, then wait for all its health checks to pass before moving on to the next VM.

This is by no means production ready code, but it might be a decent solution.

Finally a systemd timer would be set up on the XOA instance to auto-run this script on boot.

Greg_E

@cichy

I mentioned waiting for the management agent because that's already a function of checking your backups, so they have part of that puzzle in place.

cichy

@sid thanks for this! Not what I was looking for, as you stated, but useful nonetheless.

cichy

@Greg_E thanks for the clarification.