Sid Karunaratne

sid

I also went troubleshooting and found the same as @MajorP93. Specifically I saw this in the kernel logs (viewable either in dmesg or using journalctl -k) :

Freezing of tasks failed after 20.005 seconds (1 task refusing to freeze, wq_busy=1)

Quoting askubuntu.com:

Before going into suspend (or hibernate for that matter), user space processes and (some) kernel threads get frozen. If the freezing fails, it will either be due to a user space process or a kernel thread failing to freeze.

To freeze a user space process, the kernel sends it a signal that is handled automatically and, once received, cannot be ignored. If, however, the process is in the uninterruptible sleep state (e.g. waiting for I/O that cannot complete due to the device being unavailable), it will not receive the signal straight away. If this delay lasts longer than 20s (=default freeze timeout, see /sys/power/pm_freeze_timeout (in miliseconds)), the freezing will fail.

NFS, CIFS and FUSE amongst others have been historically known for causing issues like that.

Also from that post:

You can grep the problematic task like this # dmesg |grep "task.*pid"

In my case it was prometheus docker containers.

sid

@cichy I know this isn't as easy as what you're asking for, but I wrote some terrible python code.

It relies on health checks being defined as VM tags, or at least the management agent being detected. For example in my terraform code I have these tags on a test postgres instance and test nginx instances respectively:

# postgres
  tags = [
    "bootOrder/agent-detect-timeout=45",
    "bootOrder/ip=${jsonencode("auto")}",
    "bootOrder/healtcheck/tcp=${jsonencode({
      "port" : 5432,
    })}",
  ]

# nginx
  tags = [
    "bootOrder/agent-detect-timeout=45",
    "bootOrder/ip=${jsonencode("auto")}",
    "bootOrder/healtcheck/http=${jsonencode({
      "port" : 80,
      "scheme" : "http",
      "path" : "/"
    })}",
  ]

Then the actual python:

#!/usr/bin/env python3
import urllib3
import json
import os
import sys
import socket
import time
import logging

logging.basicConfig(level=logging.INFO)

BOOT_ORDER = [
    # Postgres
    ["55e88cb4-0c50-8384-2149-cf73e40b8c8e"],
    # nginx
    ["ba620f01-69d1-ddd8-b1d4-c256abe07e05", "bbe333bd-380a-1f94-4052-881c763b6177"],
]

DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS = 60

class HealthCheck:
    def __init__(self, target: str, config: dict) -> None:
        self.type = "base"
        self.target = target
        self.config = config
        self.timeout = 3
        self.retry_max_count = 5
        self.retry_cur_count = 0
        self.retry_sleep = 10

    def _retry(self):
        if self.retry_cur_count == 0:
            logging.info("Starting %s healtcheck against %s", self.type, self.target)
            self.retry_cur_count += 1
            return True
        if self.retry_cur_count == self.retry_max_count:
            logging.warning('Failed Healtcheck of type %s for %s', self.type, self.target) 
            return False
        time.sleep(self.retry_sleep)
        self.retry_cur_count += 1
        return True


class TCPHealthCheck(HealthCheck):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.type = "TCP"

    def run(self):
        port = self.config.get("port")
        while self._retry():
            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
                sock.settimeout(self.timeout)
                success = sock.connect_ex((self.target, port)) == 0
                if success:
                    return True
        return False


class HttpHealthCheck(HealthCheck):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.type = "HTTP"

    def run(self):
        while self._retry():
            assert_hostname = self.config.get("tls_verification", True)
            http = urllib3.PoolManager(
                cert_reqs="CERT_REQUIRED" if assert_hostname else "CERT_NONE",
            )
            scheme = self.config.get("scheme", "http")
            port = self.config.get("port", 80)
            path = self.config.get("path", "").lstrip("/")
            url = f"{scheme}://{self.target}:{port}/{path}"
            response = http.request('GET', url, timeout=self.timeout)
            if response.status >= 200 and response.status < 300:
                return True
        return False

class XoaClient:
    def __init__(self, base_url: str, token: str) -> None:
        self.base_url = base_url.rstrip("/")
        self.tags_prefix = "bootOrder/"
        self.token = token
        self.http = urllib3.PoolManager()
        self.headers = {
            "Content-Type": "application/json",
            "Cookie": f"token={self.token}",
        }
        self._vm_cache = {}

    def vm_ip(self, uuid):
        vm_tags = self._extract_vm_tags(uuid)
        ip = vm_tags.get("ip", "auto")
        if ip != "auto":
            return ip
        return self._get_vm(uuid).get("mainIpAddress")

    def vm_healthcheck(self, uuid):
        vm_tags = self._extract_vm_tags(uuid)
        tcp = vm_tags.get("healtcheck/tcp")
        http = vm_tags.get("healtcheck/http")
        return tcp, http


    def _get_vm(self, uuid: str):
        url = f"{self.base_url}/rest/v0/vms/{uuid}"
        # if url in self._vm_cache:
        #     return self._vm_cache[url]
        response = self.http.request("GET", url, headers=self.headers)
        result = self._handle_json_response(response)

        self._vm_cache[url] = result
        return result

    def _extract_vm_tags(self, uuid: str) -> dict:
        dict_tags = {}
        tags = self._get_vm(uuid).get("tags")
        for tag in tags:
            if tag.startswith(self.tags_prefix):
                k,v = tag.split("=", 1)
                k = k[len(self.tags_prefix):]
                dict_tags[k] = json.loads(v)
        return dict_tags

    def start_vm(self, uuid: str):
        if self._get_vm(uuid).get("power_state") == "Running":
            return
        url = f"{self.base_url}/rest/v0/vms/{uuid}/actions/start?sync=true"
        response = self.http.request("POST", url, headers=self.headers)
        if response.status != 204:
            raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")
        return

    def management_agent_detected(self, uuid: str) -> bool:
        return self._get_vm(uuid).get("managementAgentDetected")

    def vm_agent_detection_timeout(self, uuid: str, default_seconds: int = 60) -> bool:
        tags = self._extract_vm_tags(uuid)
        return tags.get("agent-detect-timeout", default_seconds)

    def _handle_json_response(self, response):
        if response.status >= 200 and response.status < 300:
            return json.loads(response.data.decode("utf-8"))
        else:
            raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")



if __name__ == "__main__":
    xoa_url = os.getenv("XOA_URL")
    xoa_token = os.getenv("XOA_TOKEN")
    if not xoa_url:
        logging.fatal("Missing XOA_URL environment variable")
        sys.exit(1)
    if not xoa_token:
        logging.fatal("Missing XOA_TOKEN environment variable")
        sys.exit(1)
    client = XoaClient(xoa_url, xoa_token)

    group_number = 1
    for boot_group in BOOT_ORDER:
        logging.info("Starting to boot group %s, length %s", group_number, len(boot_group))
        # These should be booted in parallel, but aren't
        for uuid in boot_group:
            client.start_vm(uuid)
            timeout = client.vm_agent_detection_timeout(
                uuid=uuid,
                default_seconds=DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS,
            )
            mad = False
            for n in range(timeout):
                mad = client.management_agent_detected(uuid)
                if mad:
                    break
                time.sleep(1)
            if not mad:
                raise Exception(f"No management agent detected in host {uuid}")
            target = client.vm_ip(uuid)
            tcp, http = client.vm_healthcheck(uuid)
            if tcp:
                hc = TCPHealthCheck(target=target, config=tcp)
                hc.run()
            if http:
                hc = HttpHealthCheck(target=target, config=http)
                hc.run()
            logging.info("All healthchecks passed for %s", target)
        group_number += 1

It'll boot each VM in order and wait for its agent to be detected, then wait for all its health checks to pass before moving on to the next VM.

This is by no means production ready code, but it might be a decent solution.

Finally a systemd timer would be set up on the XOA instance to auto-run this script on boot.

sid

I'd like the terraform provider to have a xenorchestra_backup resource.

For me, part of the process of spinning up a new set of VMs is to create backup jobs for those new VMs.

I can today manually make a a backup job which applies to VMs with a certain tag, then later, via TF, make VMs with that tag. However I'd prefer being able to make a xenorchestra_backup resource, with settings specific to that VM (or set of VMs).

Furthermore, if the idea with backup schedules is that they can be used across backup jobs, then that would mean a new xenorchestra_backup_schedule resource type too, which would be referenced in the xenorchestra_backup. Also, this might require creating a xenorchestra_remote data-source.

Having said that, I am not a paying customer, so I understand this is a low priority request, and I do have a workaround.

sid

@HolgiB said in Better / more flexible way to add and edit CloudInit templates in XO ?:

All this Terraform / Open Tofu stuff is nice but I guess generating VMs via Cloud Init and XO will be the entry level for everyone before trying out a much bigger infrastructure as code solution, right ?

Starting to use Terraform is not a big step, and it's how I manage even a small setup with < 20 VMs, though not everything can be managed that way, for example SRs cannot currently be created through terraform, so XO is still needed too.

those init files often are technically correct but still fail for some strange reason

I agree, debugging cloud-init is not a fun task. I don't know if it is an approach you could take, but in my case I keep my cloud-init extremely simple. It only sets up networking and a single fixed user account with an SSH key. Then other tooling, for example Ansible, takes over from there to configure the VM.

sid

@HolgiB via Settings → Cloud configs you can already store some pre-made configs, for example the DHCP ones.

However it might not be a good solution for your fixed-IP configs, as you might forget to replace the fixed network settings and end up launching a VM with incorrect settings.

And yes, I also wish those textareas used all the available horizontal space.

As @vmpr mentioned, keeping them in version control sounds like a good plan, and also means you can make your own web tool which can validate / prompt for input based on your templates.

sid

I also went troubleshooting and found the same as @MajorP93. Specifically I saw this in the kernel logs (viewable either in dmesg or using journalctl -k) :

Freezing of tasks failed after 20.005 seconds (1 task refusing to freeze, wq_busy=1)

Quoting askubuntu.com:

Before going into suspend (or hibernate for that matter), user space processes and (some) kernel threads get frozen. If the freezing fails, it will either be due to a user space process or a kernel thread failing to freeze.

To freeze a user space process, the kernel sends it a signal that is handled automatically and, once received, cannot be ignored. If, however, the process is in the uninterruptible sleep state (e.g. waiting for I/O that cannot complete due to the device being unavailable), it will not receive the signal straight away. If this delay lasts longer than 20s (=default freeze timeout, see /sys/power/pm_freeze_timeout (in miliseconds)), the freezing will fail.

NFS, CIFS and FUSE amongst others have been historically known for causing issues like that.

Also from that post:

You can grep the problematic task like this # dmesg |grep "task.*pid"

In my case it was prometheus docker containers.

sid

@DustinB Thank you. I understand, it's absolutely a cosmetic request.

sid

See attached screenshot of 4 running tasks:

In it, the tasks all have a % number (3%, 3%, 50% and 96%) and a progress bar. The number is shown as part of the task name, rather than near its corresponding progress bar.

Perhaps I'm missing the reason why it is as it is, but I think it makes more sense to have the number near its corresponding bar, probably either on its left or right. Does that make sense, or did I miss something?

sid

@cichy I know this isn't as easy as what you're asking for, but I wrote some terrible python code.

It relies on health checks being defined as VM tags, or at least the management agent being detected. For example in my terraform code I have these tags on a test postgres instance and test nginx instances respectively:

# postgres
  tags = [
    "bootOrder/agent-detect-timeout=45",
    "bootOrder/ip=${jsonencode("auto")}",
    "bootOrder/healtcheck/tcp=${jsonencode({
      "port" : 5432,
    })}",
  ]

# nginx
  tags = [
    "bootOrder/agent-detect-timeout=45",
    "bootOrder/ip=${jsonencode("auto")}",
    "bootOrder/healtcheck/http=${jsonencode({
      "port" : 80,
      "scheme" : "http",
      "path" : "/"
    })}",
  ]

Then the actual python:

#!/usr/bin/env python3
import urllib3
import json
import os
import sys
import socket
import time
import logging

logging.basicConfig(level=logging.INFO)

BOOT_ORDER = [
    # Postgres
    ["55e88cb4-0c50-8384-2149-cf73e40b8c8e"],
    # nginx
    ["ba620f01-69d1-ddd8-b1d4-c256abe07e05", "bbe333bd-380a-1f94-4052-881c763b6177"],
]

DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS = 60

class HealthCheck:
    def __init__(self, target: str, config: dict) -> None:
        self.type = "base"
        self.target = target
        self.config = config
        self.timeout = 3
        self.retry_max_count = 5
        self.retry_cur_count = 0
        self.retry_sleep = 10

    def _retry(self):
        if self.retry_cur_count == 0:
            logging.info("Starting %s healtcheck against %s", self.type, self.target)
            self.retry_cur_count += 1
            return True
        if self.retry_cur_count == self.retry_max_count:
            logging.warning('Failed Healtcheck of type %s for %s', self.type, self.target) 
            return False
        time.sleep(self.retry_sleep)
        self.retry_cur_count += 1
        return True


class TCPHealthCheck(HealthCheck):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.type = "TCP"

    def run(self):
        port = self.config.get("port")
        while self._retry():
            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
                sock.settimeout(self.timeout)
                success = sock.connect_ex((self.target, port)) == 0
                if success:
                    return True
        return False


class HttpHealthCheck(HealthCheck):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.type = "HTTP"

    def run(self):
        while self._retry():
            assert_hostname = self.config.get("tls_verification", True)
            http = urllib3.PoolManager(
                cert_reqs="CERT_REQUIRED" if assert_hostname else "CERT_NONE",
            )
            scheme = self.config.get("scheme", "http")
            port = self.config.get("port", 80)
            path = self.config.get("path", "").lstrip("/")
            url = f"{scheme}://{self.target}:{port}/{path}"
            response = http.request('GET', url, timeout=self.timeout)
            if response.status >= 200 and response.status < 300:
                return True
        return False

class XoaClient:
    def __init__(self, base_url: str, token: str) -> None:
        self.base_url = base_url.rstrip("/")
        self.tags_prefix = "bootOrder/"
        self.token = token
        self.http = urllib3.PoolManager()
        self.headers = {
            "Content-Type": "application/json",
            "Cookie": f"token={self.token}",
        }
        self._vm_cache = {}

    def vm_ip(self, uuid):
        vm_tags = self._extract_vm_tags(uuid)
        ip = vm_tags.get("ip", "auto")
        if ip != "auto":
            return ip
        return self._get_vm(uuid).get("mainIpAddress")

    def vm_healthcheck(self, uuid):
        vm_tags = self._extract_vm_tags(uuid)
        tcp = vm_tags.get("healtcheck/tcp")
        http = vm_tags.get("healtcheck/http")
        return tcp, http


    def _get_vm(self, uuid: str):
        url = f"{self.base_url}/rest/v0/vms/{uuid}"
        # if url in self._vm_cache:
        #     return self._vm_cache[url]
        response = self.http.request("GET", url, headers=self.headers)
        result = self._handle_json_response(response)

        self._vm_cache[url] = result
        return result

    def _extract_vm_tags(self, uuid: str) -> dict:
        dict_tags = {}
        tags = self._get_vm(uuid).get("tags")
        for tag in tags:
            if tag.startswith(self.tags_prefix):
                k,v = tag.split("=", 1)
                k = k[len(self.tags_prefix):]
                dict_tags[k] = json.loads(v)
        return dict_tags

    def start_vm(self, uuid: str):
        if self._get_vm(uuid).get("power_state") == "Running":
            return
        url = f"{self.base_url}/rest/v0/vms/{uuid}/actions/start?sync=true"
        response = self.http.request("POST", url, headers=self.headers)
        if response.status != 204:
            raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")
        return

    def management_agent_detected(self, uuid: str) -> bool:
        return self._get_vm(uuid).get("managementAgentDetected")

    def vm_agent_detection_timeout(self, uuid: str, default_seconds: int = 60) -> bool:
        tags = self._extract_vm_tags(uuid)
        return tags.get("agent-detect-timeout", default_seconds)

    def _handle_json_response(self, response):
        if response.status >= 200 and response.status < 300:
            return json.loads(response.data.decode("utf-8"))
        else:
            raise Exception(f"HTTP {response.status}: {response.data.decode('utf-8')}")



if __name__ == "__main__":
    xoa_url = os.getenv("XOA_URL")
    xoa_token = os.getenv("XOA_TOKEN")
    if not xoa_url:
        logging.fatal("Missing XOA_URL environment variable")
        sys.exit(1)
    if not xoa_token:
        logging.fatal("Missing XOA_TOKEN environment variable")
        sys.exit(1)
    client = XoaClient(xoa_url, xoa_token)

    group_number = 1
    for boot_group in BOOT_ORDER:
        logging.info("Starting to boot group %s, length %s", group_number, len(boot_group))
        # These should be booted in parallel, but aren't
        for uuid in boot_group:
            client.start_vm(uuid)
            timeout = client.vm_agent_detection_timeout(
                uuid=uuid,
                default_seconds=DEFAULT_AGENT_DETECT_TIMEOUT_SECONDS,
            )
            mad = False
            for n in range(timeout):
                mad = client.management_agent_detected(uuid)
                if mad:
                    break
                time.sleep(1)
            if not mad:
                raise Exception(f"No management agent detected in host {uuid}")
            target = client.vm_ip(uuid)
            tcp, http = client.vm_healthcheck(uuid)
            if tcp:
                hc = TCPHealthCheck(target=target, config=tcp)
                hc.run()
            if http:
                hc = HttpHealthCheck(target=target, config=http)
                hc.run()
            logging.info("All healthchecks passed for %s", target)
        group_number += 1

It'll boot each VM in order and wait for its agent to be detected, then wait for all its health checks to pass before moving on to the next VM.

This is by no means production ready code, but it might be a decent solution.

Finally a systemd timer would be set up on the XOA instance to auto-run this script on boot.

sid

@cichy I think there's more to it; namely that after a VM has booted, it might still be a while until its services are available. For example a ceph cluster, or an Elasticsearch cluster can take a while to come up, or what if there's an issue where something which should be quick to start fails to start?

So I think not only do you want to boot VMs in a certain order, but you also want a health check to pass before proceeding to the next VM.

If that is the case, then this might be out of scope of the hypervisor, and instead this could be a script that runs on a VM which is set to auto-start, and doesn't itself depend on other VMs, where it uses the API to start VMs and perform health checks before continuing to the next.

sid

@Cyrille Aah, I didn't know about the branches. I had started my own attempt to implement the feature, good to know I can abandon that work. Oh boy discovering the settings map uses an empty key was a moment.

OK, I will wait. Thanks to your team for the work on the terraform provider

sid

I'd like the terraform provider to have a xenorchestra_backup resource.

For me, part of the process of spinning up a new set of VMs is to create backup jobs for those new VMs.

I can today manually make a a backup job which applies to VMs with a certain tag, then later, via TF, make VMs with that tag. However I'd prefer being able to make a xenorchestra_backup resource, with settings specific to that VM (or set of VMs).

Furthermore, if the idea with backup schedules is that they can be used across backup jobs, then that would mean a new xenorchestra_backup_schedule resource type too, which would be referenced in the xenorchestra_backup. Also, this might require creating a xenorchestra_remote data-source.

Having said that, I am not a paying customer, so I understand this is a low priority request, and I do have a workaround.

Sid Karunaratne

@sid

Best posts made by sid

Latest posts made by sid