XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    VM Boot Order via XO?

    Scheduled Pinned Locked Moved Migrate to XCP-ng
    8 Posts 4 Posters 92 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • cichyC Offline
      cichy
      last edited by

      Hello ~

      I found this article within the forums. Is it only possible to prioritize the boot order of VM's via the CLI? Or is there a facility with XO to do so? Thanks in advance for your help.

      USE CASE

      • Multiple DB types and clusters of DB's
      • K3S cluster
      • K8S cluster
      • Swarm cluster
      • DB's are required to be up prior to apps clusters (as listed) and in some cases there are interdependencies between various apps across K3S, K8S, and Swarm

      All I am looking to do at the moment is ensure the DB VM's are up and operational prior to initializing my clusters. I can create yml configs to ensure the interdependencies are met within my app infra at a later date. With ESXi this was super easy, even Proxmox to a degree.

      Still very much new to XCP-ng and need some assistance finding this within the XO GUI.

      T 1 Reply Last reply Reply Quote 0
      • T Offline
        ThasianXi @cichy
        last edited by

        @cichy This can be accomplished using vApps via the CLI.
        Check out this forum post in which a member shared the process: XCP-ng Pool Boot Order

        cichyC 1 Reply Last reply Reply Quote 1
        • cichyC Offline
          cichy @ThasianXi
          last edited by

          @ThasianXi this is a solution, not the one I was hoping for. I was explicitly hoping for an XO guy-led solution. Something similar to the way VMWare allows one to drag vm's in boot pref. @olivierlambert if this is not a feature request already, how do I make it one?

          sidS 1 Reply Last reply Reply Quote 0
          • sidS Offline
            sid @cichy
            last edited by

            @cichy I think there's more to it; namely that after a VM has booted, it might still be a while until its services are available. For example a ceph cluster, or an Elasticsearch cluster can take a while to come up, or what if there's an issue where something which should be quick to start fails to start?

            So I think not only do you want to boot VMs in a certain order, but you also want a health check to pass before proceeding to the next VM.

            If that is the case, then this might be out of scope of the hypervisor, and instead this could be a script that runs on a VM which is set to auto-start, and doesn't itself depend on other VMs, where it uses the API to start VMs and perform health checks before continuing to the next.

            cichyC 1 Reply Last reply Reply Quote 0
            • cichyC Offline
              cichy @sid
              last edited by

              @sid this is how my K8S/K3S and Stack yml configs work. They leverage interdependencies and health checks across clusters (ping via IP and listen for reply etc) work. API or potentially webhooks could work. Though, what I’m asking/looking for is a way that XO could facilitate. Even if the “script” were added via XO-UI and I could expose apps/services/vm’s dynamically via said UI. Proxmox has something similar to this, it’s buried with VM config options within the UI.

              G 1 Reply Last reply Reply Quote 0
              • G Offline
                Greg_E @cichy
                last edited by

                @cichy

                Best way to implement this from XO would be to wait until the management agent comes up and then move on to the next VMs in the boot order.

                I'll have to check out the cli method and see what's involved, there are a few VMs I want to boot in order if the power goes out, things like get the AD VMs up and running so DHCP and DNS are working before other stuff gets booted.

                My system is small so I don't have things like the "traditional" LAMP stack with separate application, DB, other VMs so I've never looked into this aspect. But it's right on the surface for VMware and a lot of people are going to be looking for this functionality in one way or another. The VMware drag and drop makes it easy for a Junior level admin to configure a vApp to handle this type of requirement.

                cichyC 1 Reply Last reply Reply Quote 0
                • cichyC Offline
                  cichy @Greg_E
                  last edited by

                  @Greg_E exactly! I'm thinking of my junior/int sys-admins who are very comfortable with VMWare process (vCenter). It's dead simple, yet, even vCenter doesn't have a facility that can scan for app/service feedback as a health check. So, this would be a first.

                  sidS 1 Reply Last reply Reply Quote 0
                  • sidS Offline
                    sid @cichy
                    last edited by

                    @cichy I know this isn't as easy as what you're asking for, but I wrote some terrible python code.

                    It relies on health checks being defined as VM tags, or at least the management agent being detected. For example in my terraform code I have these tags on a test postgres instance and test nginx instances respectively:

                    # postgres
                      tags = [
                        "bootOrder/agent-detect-timeout=45",
                        "bootOrder/ip=${jsonencode("auto")}",
                        "bootOrder/healtcheck/tcp=${jsonencode({
                          "port" : 5432,
                        })}",
                      ]
                    
                    # nginx
                      tags = [
                        "bootOrder/agent-detect-timeout=45",
                        "bootOrder/ip=${jsonencode("auto")}",
                        "bootOrder/healtcheck/http=${jsonencode({
                          "port" : 80,
                          "scheme" : "http",
                          "path" : "/"
                        })}",
                      ]
                    

                    Then the actual python:

                    #!/usr/bin/env python3
                    import urllib3
                    import json
                    import os
                    import sys
                    import socket
                    import time
                    import logging
                    
                    logging.basicConfig(level=logging.INFO)
                    
                    BOOT_ORDER = [
                        # Postgres
                        ["55e88cb4-0c50-8384-2149-cf73e40b8c8e"],
                        # nginx
                        ["ba620f01-69d1-ddd8-b1d4-c256abe07e05", "bbe333bd-380a-1f94-4052-881c763b6177"],
                    ]
                    
                    DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS = 60
                    
                    class HealthCheck:
                        def __init__(self, target: str, config: dict) -> None:
                            self.type = "base"
                            self.target = target
                            self.config = config
                            self.timeout = 3
                            self.retry_max_count = 5
                            self.retry_cur_count = 0
                            self.retry_sleep = 10
                    
                        def _retry(self):
                            if self.retry_cur_count == 0:
                                logging.info("Starting %s healtcheck against %s", self.type, self.target)
                                self.retry_cur_count += 1
                                return True
                            if self.retry_cur_count == self.retry_max_count:
                                logging.warning('Failed Healtcheck of type %s for %s', self.type, self.target) 
                                return False
                            time.sleep(self.retry_sleep)
                            self.retry_cur_count += 1
                            return True
                    
                    
                    class TCPHealthCheck(HealthCheck):
                        def __init__(self, **kwargs):
                            super().__init__(**kwargs)
                            self.type = "TCP"
                    
                        def run(self):
                            port = self.config.get("port")
                            while self._retry():
                                with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
                                    sock.settimeout(self.timeout)
                                    success = sock.connect_ex((self.target, port)) == 0
                                    if success:
                                        return True
                            return False
                    
                    
                    class HttpHealthCheck(HealthCheck):
                        def __init__(self, **kwargs):
                            super().__init__(**kwargs)
                            self.type = "HTTP"
                    
                        def run(self):
                            while self._retry():
                                assert_hostname = self.config.get("tls_verification", True)
                                http = urllib3.PoolManager(
                                    cert_reqs="CERT_REQUIRED" if assert_hostname else "CERT_NONE",
                                )
                                scheme = self.config.get("scheme", "http")
                                port = self.config.get("port", 80)
                                path = self.config.get("path", "").lstrip("/")
                                url = f"{scheme}://{self.target}:{port}/{path}"
                                response = http.request('GET', url, timeout=self.timeout)
                                if response.status >= 200 and response.status < 300:
                                    return True
                            return False
                    
                    class XoaClient:
                        def __init__(self, base_url: str, token: str) -> None:
                            self.base_url = base_url.rstrip("/")
                            self.tags_prefix = "bootOrder/"
                            self.token = token
                            self.http = urllib3.PoolManager()
                            self.headers = {
                                "Content-Type": "application/json",
                                "Cookie": f"token={self.token}",
                            }
                            self._vm_cache = {}
                    
                        def vm_ip(self, uuid):
                            vm_tags = self._extract_vm_tags(uuid)
                            ip = vm_tags.get("ip", "auto")
                            if ip != "auto":
                                return ip
                            return self._get_vm(uuid).get("mainIpAddress")
                    
                        def vm_healthcheck(self, uuid):
                            vm_tags = self._extract_vm_tags(uuid)
                            tcp = vm_tags.get("healtcheck/tcp")
                            http = vm_tags.get("healtcheck/http")
                            return tcp, http
                    
                    
                        def _get_vm(self, uuid: str):
                            url = f"{self.base_url}/rest/v0/vms/{uuid}"
                            # if url in self._vm_cache:
                            #     return self._vm_cache[url]
                            response = self.http.request("GET", url, headers=self.headers)
                            result = self._handle_json_response(response)
                    
                            self._vm_cache[url] = result
                            return result
                    
                        def _extract_vm_tags(self, uuid: str) -> dict:
                            dict_tags = {}
                            tags = self._get_vm(uuid).get("tags")
                            for tag in tags:
                                if tag.startswith(self.tags_prefix):
                                    k,v = tag.split("=", 1)
                                    k = k[len(self.tags_prefix):]
                                    dict_tags[k] = json.loads(v)
                            return dict_tags
                    
                        def start_vm(self, uuid: str):
                            if self._get_vm(uuid).get("power_state") == "Running":
                                return
                            url = f"{self.base_url}/rest/v0/vms/{uuid}/actions/start?sync=true"
                            response = self.http.request("POST", url, headers=self.headers)
                            if response.status != 204:
                                raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")
                            return
                    
                        def management_agent_detected(self, uuid: str) -> bool:
                            return self._get_vm(uuid).get("managementAgentDetected")
                    
                        def vm_agent_detection_timeout(self, uuid: str, default_seconds: int = 60) -> bool:
                            tags = self._extract_vm_tags(uuid)
                            return tags.get("agent-detect-timeout", default_seconds)
                    
                        def _handle_json_response(self, response):
                            if response.status >= 200 and response.status < 300:
                                return json.loads(response.data.decode("utf-8"))
                            else:
                                raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")
                    
                    
                    
                    if __name__ == "__main__":
                        xoa_url = os.getenv("XOA_URL")
                        xoa_token = os.getenv("XOA_TOKEN")
                        if not xoa_url:
                            logging.fatal("Missing XOA_URL environment variable")
                            sys.exit(1)
                        if not xoa_token:
                            logging.fatal("Missing XOA_TOKEN environment variable")
                            sys.exit(1)
                        client = XoaClient(xoa_url, xoa_token)
                    
                        group_number = 1
                        for boot_group in BOOT_ORDER:
                            logging.info("Starting to boot group %s, length %s", group_number, len(boot_group))
                            # These should be booted in parallel, but aren't
                            for uuid in boot_group:
                                client.start_vm(uuid)
                                timeout = client.vm_agent_detection_timeout(
                                    uuid=uuid,
                                    default_seconds=DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS,
                                )
                                mad = False
                                for n in range(timeout):
                                    mad = client.management_agent_detected(uuid)
                                    if mad:
                                        break
                                    time.sleep(1)
                                if not mad:
                                    raise Exception(f"No management agent detected in host {uuid}")
                                target = client.vm_ip(uuid)
                                tcp, http = client.vm_healthcheck(uuid)
                                if tcp:
                                    hc = TCPHealthCheck(target=target, config=tcp)
                                    hc.run()
                                if http:
                                    hc = HttpHealthCheck(target=target, config=http)
                                    hc.run()
                                logging.info("All healthchecks passed for %s", target)
                            group_number += 1
                    

                    It'll boot each VM in order and wait for its agent to be detected, then wait for all its health checks to pass before moving on to the next VM.

                    This is by no means production ready code, but it might be a decent solution.

                    Finally a systemd timer would be set up on the XOA instance to auto-run this script on boot.

                    1 Reply Last reply Reply Quote 1
                    • First post
                      Last post