Jonathon

Jonathon

I have amazing news!

After the upgrade to xcp-ng 8.3, I retested velero backup, and it all just works

Completed Backup

jonathon@jonathon-framework:~$ velero --kubeconfig k8s_configs/production.yaml backup describe grafana-test
Name:         grafana-test
Namespace:    velero
Labels:       objectset.rio.cattle.io/hash=c2b5f500ab5d9b8ffe14f2c70bf3742291df565c
              velero.io/storage-location=default
Annotations:  objectset.rio.cattle.io/applied=H4sIAAAAAAAA/4SSQW/bPgzFvwvPtv9OajeJj/8N22HdBqxFL0MPlEQlWmTRkOhgQ5HvPsixE2yH7iji8ffIJ74CDu6ZYnIcoIMTeYpcOf7vtIICji4Y6OB/1MdxgAJ6EjQoCN0rYAgsKI5Dyk9WP0hLIqmi40qjiKfMcRlAq7pBY+py26qmbEi15a5p78vtaqe0oqbVVsO5AI+K/Ju4A6YDdKDXqrVtXaNqzU5traVVY9d6Uyt7t2nW693K2Pa+naABe4IO9hEtBiyFksClmgbUdN06a9NAOtvr5B4DDunA8uR64lGgg7u6rxMUYMji6OWZ/dhTeuIPaQ6os+gTFUA/tR8NmXd+TELxUfNA5hslHqOmBN13OF16ZwvNQShIqpZClYQj7qk6blPlGF5uzC/L3P+kvok7MB9z0OcCXPiLPLHmuLLWCfVfB4rTZ9/iaA5zHovNZz7R++k6JI50q89BXcuXYR5YT0DolkChABEPHWzW9cK+rPQx8jgsH/KQj+QT/frzXCdduc/Ca9u1Y7aaFvMu5Ang5Xz+HQAA//8X7Fu+/QIAAA
              objectset.rio.cattle.io/id=e104add0-85b4-4eb5-9456-819bcbe45cfc
              velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.33.4+rke2r1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=33

Phase:  Completed


Namespaces:
  Included:  grafana
  Excluded:  <none>

Resources:
  Included cluster-scoped:    <none>
  Excluded cluster-scoped:    volumesnapshotcontents.snapshot.storage.k8s.io
  Included namespace-scoped:  *
  Excluded namespace-scoped:  volumesnapshots.snapshot.storage.k8s.io

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  true
Snapshot Move Data:          true
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    30m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2025-10-15 15:29:52 -0700 PDT
Completed:  2025-10-15 15:31:25 -0700 PDT

Expiration:  2025-11-14 14:29:52 -0800 PST

Total items to be backed up:  35
Items backed up:              35

Backup Item Operations:  1 of 1 completed successfully, 0 failed (specify --details for more information)
Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots:
    grafana/central-grafana:
      Data Movement: included, specify --details for more information

  Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0

Completed Restore

jonathon@jonathon-framework:~$ velero --kubeconfig k8s_configs/production.yaml restore describe restore-grafana-test --details
Name:         restore-grafana-test
Namespace:    velero
Labels:       objectset.rio.cattle.io/hash=252addb3ed156c52d9fa9b8c045b47a55d66c0af
Annotations:  objectset.rio.cattle.io/applied=H4sIAAAAAAAA/3yRTW7zIBBA7zJrO5/j35gzfE2rtsomymIM45jGBgTjbKLcvaKJm6qL7kDwnt7ABdDpHfmgrQEBZxrJ25W2/85rSOCkjQIBrxTYeoIEJmJUyAjiAmiMZWRtTYhb232Q5EC88tquJDKPFEU6GlpUG5UVZdpUdZ6WZZ+niOtNWtR1SypvqC8buCYwYkfjn7oBwwAC8ipHpbqC1LqqZZWrtse228isrLqywapSdS0z7KPU4EQgwN+mSI8eezSYMgWG22lwKOl7/MgERzJmdChPs9veDL9IGfSbQRcGy+96IjszCCiyCRLQRo6zIrVd5AHEfuHhkIBmmp4d+a/3e9Dl8LPoCZ3T5hg7FvQRcR8nxt6XL7sAgv1MCZztOE+01P23cvmnPYzaxNtwuF4/AwAA//8k6OwC/QEAAA
              objectset.rio.cattle.io/id=9ad8d034-7562-44f2-aa18-3669ed27ef47

Phase:                       Completed
Total items to be restored:  33
Items restored:              33

Started:    2025-10-15 15:35:26 -0700 PDT
Completed:  2025-10-15 15:36:34 -0700 PDT

Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    grafana-restore:  could not restore, ConfigMap:elasticsearch-es-transport-ca-internal already exists. Warning: the in-cluster version is different than the backed-up version
                      could not restore, ConfigMap:kube-root-ca.crt already exists. Warning: the in-cluster version is different than the backed-up version

Backup:  grafana-test

Namespaces:
  Included:  grafana
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  grafana=grafana-restore

Label selector:  <none>

Or label selector:  <none>

Restore PVs:  true

CSI Snapshot Restores:
  grafana-restore/central-grafana:
    Data Movement:
      Operation ID: dd-ffa56e1c-9fd0-44b4-a8bb-8163f40a49e9.330b82fc-ca6a-423217ee5
      Data Mover: velero
      Uploader Type: kopia

Existing Resource Policy:   <none>
ItemOperationTimeout:       4h0m0s

Preserve Service NodePorts:  auto

Restore Item Operations:
  Operation for persistentvolumeclaims grafana-restore/central-grafana:
    Restore Item Action Plugin:  velero.io/csi-pvc-restorer
    Operation ID:                dd-ffa56e1c-9fd0-44b4-a8bb-8163f40a49e9.330b82fc-ca6a-423217ee5
    Phase:                       Completed
    Progress:                    856284762 of 856284762 complete (Bytes)
    Progress description:        Completed
    Created:                     2025-10-15 15:35:28 -0700 PDT
    Started:                     2025-10-15 15:36:06 -0700 PDT
    Updated:                     2025-10-15 15:36:26 -0700 PDT

HooksAttempted:   0
HooksFailed:      0

Resource List:
  apps/v1/Deployment:
    - grafana-restore/central-grafana(created)
    - grafana-restore/grafana-debug(created)
  apps/v1/ReplicaSet:
    - grafana-restore/central-grafana-5448b9f65(created)
    - grafana-restore/central-grafana-56887c6cb6(created)
    - grafana-restore/central-grafana-56ddd4f497(created)
    - grafana-restore/central-grafana-5f4757844b(created)
    - grafana-restore/central-grafana-5f69f86c85(created)
    - grafana-restore/central-grafana-64545dcdc(created)
    - grafana-restore/central-grafana-69c66c54d9(created)
    - grafana-restore/central-grafana-6c8d6f65b8(created)
    - grafana-restore/central-grafana-7b479f79ff(created)
    - grafana-restore/central-grafana-bc7d96cdd(created)
    - grafana-restore/central-grafana-cb88bd49c(created)
    - grafana-restore/grafana-debug-556845ff7b(created)
    - grafana-restore/grafana-debug-6fb594cb5f(created)
    - grafana-restore/grafana-debug-8f66bfbf6(created)
  discovery.k8s.io/v1/EndpointSlice:
    - grafana-restore/central-grafana-hkgd5(created)
  networking.k8s.io/v1/Ingress:
    - grafana-restore/central-grafana(created)
  rbac.authorization.k8s.io/v1/Role:
    - grafana-restore/central-grafana(created)
  rbac.authorization.k8s.io/v1/RoleBinding:
    - grafana-restore/central-grafana(created)
  v1/ConfigMap:
    - grafana-restore/central-grafana(created)
    - grafana-restore/elasticsearch-es-transport-ca-internal(failed)
    - grafana-restore/kube-root-ca.crt(failed)
  v1/Endpoints:
    - grafana-restore/central-grafana(created)
  v1/PersistentVolume:
    - pvc-e3f6578f-08b2-4e79-85f0-76bbf8985b55(skipped)
  v1/PersistentVolumeClaim:
    - grafana-restore/central-grafana(created)
  v1/Pod:
    - grafana-restore/central-grafana-cb88bd49c-fc5br(created)
  v1/Secret:
    - grafana-restore/fpinfra-net-cf-cert(created)
    - grafana-restore/grafana(created)
  v1/Service:
    - grafana-restore/central-grafana(created)
  v1/ServiceAccount:
    - grafana-restore/central-grafana(created)
    - grafana-restore/default(skipped)
  velero.io/v2alpha1/DataUpload:
    - velero/grafana-test-nw7zj(skipped)

Image of working restore pod, with correct data in PV

Velero installed from helm: https://vmware-tanzu.github.io/helm-charts
Version: velero:11.1.0
Values

---
image:
  repository: velero/velero
  tag: v1.17.0

# Whether to deploy the restic daemonset.
deployNodeAgent: true

initContainers:
   - name: velero-plugin-for-aws
     image: velero/velero-plugin-for-aws:latest
     imagePullPolicy: IfNotPresent
     volumeMounts:
       - mountPath: /target
         name: plugins

configuration:
  defaultItemOperationTimeout: 2h
  features: EnableCSI
  defaultSnapshotMoveData: true

  backupStorageLocation:
    - name: default
      provider: aws
      bucket: velero
      config:
        region: us-east-1
        s3ForcePathStyle: true
        s3Url: https://s3.location

  # Destination VSL points to LINSTOR snapshot class
  volumeSnapshotLocation:
    - name: linstor
      provider: velero.io/csi
      config:
        snapshotClass: linstor-vsc

credentials:
  useSecret: true
  existingSecret: velero-user


metrics:
  enabled: true

  serviceMonitor:
    enabled: true

  prometheusRule:
    enabled: true
    # Additional labels to add to deployed PrometheusRule
    additionalLabels: {}
    # PrometheusRule namespace. Defaults to Velero namespace.
    # namespace: ""
    # Rules to be deployed
    spec:
      - alert: VeleroBackupPartialFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partialy failed backups.
        expr: |-
          velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning
      - alert: VeleroBackupFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failed backups.
        expr: |-
          velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning

Also create the following.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: linstor-vsc
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: linstor.csi.linbit.com
deletionPolicy: Delete

We are using Piraeus operator to use xostor in k8s
https://github.com/piraeusdatastore/piraeus-operator.git
Version: v2.9.1
Values:

---
operator: 
  resources:
    requests:
      cpu: 250m
      memory: 500Mi
    limits:
      memory: 1Gi
installCRDs: true
imageConfigOverride:
- base: quay.io/piraeusdatastore
  components:
    linstor-satellite:
      image: piraeus-server
      tag: v1.29.0
tls:
  certManagerIssuerRef:
    name: step-issuer
    kind: StepClusterIssuer
    group: certmanager.step.sm

Then we just connect to the xostor cluster like external linstor controller.

Jonathon

@stormi

The problem was yum cache. If I did yum update right after yum update xcp-ng-release-linstor it would still fail. To get it working right away did the following

yum update xcp-ng-release-linstor
yum clean all
yum update

Jonathon

OK I figured it out! I made an init container that gets a manually created node label for the node the pod is running on. This value is the bare metal host for that k8s node. The init contianer then takes that value and makes a script wrapper and then calls linstor-csi with the correct values. After making these changes all the linstor csi containers are running with no errors.

Current problem comes from deploying and using storage class. Started with a basic one that failed, and noticed I did not know what the correct storage_pool_name name was, so went to http://IP:3370/v1/nodes/NODE/storage-pools and http://IP:3370/v1/nodes/NODE to get information.

Still troubleshooting, but wanted to provide info.

Jonathon

@Mathieu-L Yes, the hosts had been restarted after the updates were installed which included 9.2.16

Jonathon

@Mathieu-L https://xcp-ng.org/forum/post/106873 I had increased it to 16G and it still crashed.

Jonathon

@andrewperry I myself migrated our rancher management cluster from the original rke to a new rke2 cluster using this plan not too long ago, so you should not have much trouble. Feel free to ask questions

Jonathon

@nathanael-h Nice

If you have any questions let me know, I have been using this for all our on prem clusters for a while now.

Jonathon

I do not have any asks ATM, but I thought I would just share my plan that I use to create k8s clusters that we have been using for a while now.

It has grown over time and may be a bit messy, but figured better then nothing. We use this for rke2 rancher k8s clusters deployed onto out xcp-ng cluster. We use xostor for drives, and the vlan5 network is for piraeus operator to use for pv. We also use IPVS. We are using a rocky linux 9 vm template.

If these are useful to anyone and they have questions I will do my best to answer.

variable "pool" {
  default = "OVBH-PROD-XENPOOL04"
}

variable "network0" {
  default = "Native vRack"
}
variable "network1" {
  default = "VLAN80"
}
variable "network2" {
  default = "VLAN5"
}

variable "cluster_name" {
  default = "Production K8s Cluster"
}

variable "enrollment_command" {
  default = "curl -fL https://rancher.<redacted>.net/system-agent-install.sh | sudo  sh -s - --server https://rancher.<redacted>.net --label 'cattle.io/os=linux' --token <redacted>"
}


variable "node_type" {
  description = "Node type flag"
  default = {
    "1" = "--etcd --controlplane",
    "2" = "--etcd --controlplane",
    "3" = "--etcd --controlplane",
    "4" = "--worker",
    "5" = "--worker",
    "6" = "--worker",
    "7" = "--worker --taints smtp=true:NoSchedule",
    "8" = "--worker --taints smtp=true:NoSchedule",
    "9" = "--worker --taints smtp=true:NoSchedule"
  }
}
variable "node_networks" {
  description = "Node network flag"
  default = {
    "1" = "--internal-address 10.1.8.100 --address <redacted>",
    "2" = "--internal-address 10.1.8.101 --address <redacted>",
    "3" = "--internal-address 10.1.8.102 --address <redacted>",
    "4" = "--internal-address 10.1.8.103 --address <redacted>",
    "5" = "--internal-address 10.1.8.104 --address <redacted>",
    "6" = "--internal-address 10.1.8.105 --address <redacted>",
    "7" = "--internal-address 10.1.8.106 --address <redacted>",
    "8" = "--internal-address 10.1.8.107 --address <redacted>",
    "9" = "--internal-address 10.1.8.108 --address <redacted>"
  }
}


variable "vm_name" {
  description = "Node type flag"
  default = {
    "1" = "OVBH-VPROD-K8S01-MASTER01",
    "2" = "OVBH-VPROD-K8S01-MASTER02",
    "3" = "OVBH-VPROD-K8S01-MASTER03",
    "4" = "OVBH-VPROD-K8S01-WORKER01",
    "5" = "OVBH-VPROD-K8S01-WORKER02",
    "6" = "OVBH-VPROD-K8S01-WORKER03",
    "7" = "OVBH-VPROD-K8S01-WORKER04",
    "8" = "OVBH-VPROD-K8S01-WORKER05",
    "9" = "OVBH-VPROD-K8S01-WORKER06"
  }
}

variable "preferred_host" {
  default = {
    "1" = "85838113-e4b8-4520-9f6d-8f3cf554c8f1",
    "2" = "783c27ac-2dcb-4798-9ca8-27f5f30791f6",
    "3" = "c03e1a45-4c4c-46f5-a2a1-d8de2e22a866",
    "4" = "85838113-e4b8-4520-9f6d-8f3cf554c8f1",
    "5" = "783c27ac-2dcb-4798-9ca8-27f5f30791f6",
    "6" = "c03e1a45-4c4c-46f5-a2a1-d8de2e22a866",
    "7" = "85838113-e4b8-4520-9f6d-8f3cf554c8f1",
    "8" = "783c27ac-2dcb-4798-9ca8-27f5f30791f6",
    "9" = "c03e1a45-4c4c-46f5-a2a1-d8de2e22a866"
  }
}

variable "xoa_admin_password" {
}

variable "host_count" {
  description = "All drives go to xostor"
  default = {
    "1" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "2" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "3" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "4" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "5" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "6" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "7" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "8" = "479ca676-20a1-4051-7189-a4a9ca47e00d",
    "9" = "479ca676-20a1-4051-7189-a4a9ca47e00d"
  }
}

variable "network1_ip_mapping" {
  description = "Mapping for network1 ips, vlan80"
  default = {
    "1" = "10.1.8.100",
    "2" = "10.1.8.101",
    "3" = "10.1.8.102",
    "4" = "10.1.8.103",
    "5" = "10.1.8.104",
    "6" = "10.1.8.105",
    "7" = "10.1.8.106",
    "8" = "10.1.8.107",
    "9" = "10.1.8.108"
  }
}

variable "network1_gateway" {
  description = "Mapping for public ip gateways, from hosts"
  default     = "10.1.8.1"
}

variable "network1_prefix" {
  description = "Prefix for the network used"
  default     = "22"
}

variable "network2_ip_mapping" {
  description = "Mapping for network2 ips, VLAN5"
  default = {
    "1" = "10.2.5.30",
    "2" = "10.2.5.31",
    "3" = "10.2.5.32",
    "4" = "10.2.5.33",
    "5" = "10.2.5.34",
    "6" = "10.2.5.35",
    "7" = "10.2.5.36",
    "8" = "10.2.5.37",
    "9" = "10.2.5.38"
  }
}


variable "network2_prefix" {
  description = "Prefix for the network used"
  default     = "22"
}

variable "network0_ip_mapping" {
  description = "Mapping for network0 ips, public"
  default = {
<redacted>
  }
}

variable "network0_gateway" {
  description = "Mapping for public ip gateways, from hosts"
  default = {
<redacted>
  }
}

variable "network0_prefix" {
  description = "Prefix for the network used"
  default = {
<redacted>
  }
}

# Instruct terraform to download the provider on `terraform init`
terraform {
  required_providers {
    xenorchestra = {
      source  = "vatesfr/xenorchestra"
      version = "~> 0.29.0"
    }
  }
}

# Configure the XenServer Provider
provider "xenorchestra" {
  # Must be ws or wss
  url      = "ws://10.2.0.5"        # Or set XOA_URL environment variable
  username = "admin@admin.net"      # Or set XOA_USER environment variable
  password = var.xoa_admin_password # Or set XOA_PASSWORD environment variable
}

data "xenorchestra_pool" "pool" {
  name_label = var.pool
}

data "xenorchestra_template" "template" {
  name_label = "Rocky Linux 9 Template"
  pool_id    = data.xenorchestra_pool.pool.id
}

data "xenorchestra_network" "net1" {
  name_label = var.network1
  pool_id    = data.xenorchestra_pool.pool.id
}
data "xenorchestra_network" "net2" {
  name_label = var.network2
  pool_id    = data.xenorchestra_pool.pool.id
}
data "xenorchestra_network" "net0" {
  name_label = var.network0
  pool_id    = data.xenorchestra_pool.pool.id
}

resource "xenorchestra_cloud_config" "node" {
  count    = 9
  name     = "${lower(lookup(var.vm_name, count.index + 1))}_cloud_config"
  template = <<EOF

#cloud-config
ssh_authorized_keys:
  - ssh-rsa <redacted>

write_files:
  - path: /etc/NetworkManager/conf.d/rke2-canal.conf
    permissions: '0755'
    owner: root
    content: |
      [keyfile]
      unmanaged-devices=interface-name:cali*;interface-name:flannel*
  - path: /tmp/selinux_kmod_drbd.log
    permissions: '0640'
    owner: root
    content: |
      type=AVC msg=audit(1661803314.183:778): avc:  denied  { module_load } for  pid=148256 comm="insmod" path="/tmp/ko/drbd.ko" dev="overlay" ino=101839829 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:object_r:var_lib_t:s0 tclass=system permissive=0
      type=AVC msg=audit(1661803314.185:779): avc:  denied  { module_load } for  pid=148257 comm="insmod" path="/tmp/ko/drbd_transport_tcp.ko" dev="overlay" ino=101839831 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:object_r:var_lib_t:s0 tclass=system permissive=0
  - path: /etc/sysconfig/modules/ipvs.modules
    permissions: 0755
    owner: root
    content: |
      #!/bin/bash
      modprobe -- ip_vs
      modprobe -- ip_vs_rr
      modprobe -- ip_vs_wrr
      modprobe -- ip_vs_sh
      modprobe -- nf_conntrack
  - path: /etc/modules-load.d/ipvs.conf
    permissions: 0755
    owner: root
    content: |
      ip_vs
      ip_vs_rr
      ip_vs_wrr
      ip_vs_sh
      nf_conntrack

#cloud-init
runcmd:
  - sudo hostnamectl set-hostname --static ${lower(lookup(var.vm_name, count.index + 1))}.<redacted>.com
  - sudo hostnamectl set-hostname ${lower(lookup(var.vm_name, count.index + 1))}.<redacted>.com
  - nmcli -t -f NAME con show | xargs -d '\n' -I {} nmcli con delete "{}"
  - nmcli con add type ethernet con-name public ifname enX0
  - nmcli con mod public ipv4.address '${lookup(var.network0_ip_mapping, count.index + 1)}/${lookup(var.network0_prefix, count.index + 1)}'
  - nmcli con mod public ipv4.method manual
  - nmcli con mod public ipv4.ignore-auto-dns yes
  - nmcli con mod public ipv4.gateway '${lookup(var.network0_gateway, count.index + 1)}'
  - nmcli con mod public ipv4.dns "8.8.8.8 8.8.4.4"
  - nmcli con mod public connection.autoconnect true
  - nmcli con up public
  - nmcli con add type ethernet con-name vlan80 ifname enX1
  - nmcli con mod vlan80 ipv4.address '${lookup(var.network1_ip_mapping, count.index + 1)}/${var.network1_prefix}'
  - nmcli con mod vlan80 ipv4.method manual
  - nmcli con mod vlan80 ipv4.ignore-auto-dns yes
  - nmcli con mod vlan80 ipv4.ignore-auto-routes yes
  - nmcli con mod vlan80 ipv4.gateway '${var.network1_gateway}'
  - nmcli con mod vlan80 ipv4.dns "${var.network1_gateway}"
  - nmcli con mod vlan80 connection.autoconnect true
  - nmcli con mod vlan80 ipv4.never-default true
  - nmcli con mod vlan80 ipv6.never-default true
  - nmcli con mod vlan80 ipv4.routes "10.0.0.0/8 ${var.network1_gateway}"
  - nmcli con up vlan80
  - nmcli con add type ethernet con-name vlan5 ifname enX2
  - nmcli con mod vlan5 ipv4.address '${lookup(var.network2_ip_mapping, count.index + 1)}/${var.network2_prefix}'
  - nmcli con mod vlan5 ipv4.method manual
  - nmcli con mod vlan5 ipv4.ignore-auto-dns yes
  - nmcli con mod vlan5 ipv4.ignore-auto-routes yes
  - nmcli con mod vlan5 connection.autoconnect true
  - nmcli con mod vlan5 ipv4.never-default true
  - nmcli con mod vlan5 ipv6.never-default true
  - nmcli con up vlan5
  - systemctl restart NetworkManager
  - dnf upgrade -y
  - dnf install ipset ipvsadm -y
  - bash /etc/sysconfig/modules/ipvs.modules
  - dnf install chrony -y
  - sudo systemctl enable --now chronyd
  - yum install kernel-devel kernel-headers -y
  - yum install elfutils-libelf-devel -y
  - swapoff -a
  - modprobe -- ip_tables
  - systemctl disable --now firewalld.service
  - systemctl disable --now rngd
  - dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
  - dnf install containerd.io tar -y
  - dnf install policycoreutils-python-utils -y
  - cat /tmp/selinux_kmod_drbd.log | sudo audit2allow -M insmoddrbd
  - sudo semodule -i insmoddrbd.pp
  - ${var.enrollment_command} ${lookup(var.node_type, count.index + 1)} ${lookup(var.node_networks, count.index + 1)}

bootcmd:
  - swapoff -a
  - modprobe -- ip_tables
EOF
}

resource "xenorchestra_vm" "master" {
  count            = 3
  cpus             = 4
  memory_max       = 8589934592
  cloud_config     = xenorchestra_cloud_config.node[count.index].template
  name_label       = lookup(var.vm_name, count.index + 1)
  name_description = "${var.cluster_name} master"
  template         = data.xenorchestra_template.template.id
  auto_poweron     = true
  affinity_host    = lookup(var.preferred_host, count.index + 1)

  network {
    network_id = data.xenorchestra_network.net0.id
  }
  network {
    network_id = data.xenorchestra_network.net1.id
  }
  network {
    network_id = data.xenorchestra_network.net2.id
  }
  disk {
    sr_id      = lookup(var.host_count, count.index + 1)
    name_label = "Terraform_disk_imavo"
    size       = 107374182400
  }
}


resource "xenorchestra_vm" "worker" {
  count            = 3
  cpus             = 32
  memory_max       = 68719476736
  cloud_config     = xenorchestra_cloud_config.node[count.index + 3].template
  name_label       = lookup(var.vm_name, count.index + 3 + 1)
  name_description = "${var.cluster_name} worker"
  template         = data.xenorchestra_template.template.id
  auto_poweron     = true
  affinity_host    = lookup(var.preferred_host, count.index + 3 + 1)
  
  network {
    network_id = data.xenorchestra_network.net0.id
  }
  network {
    network_id = data.xenorchestra_network.net1.id
  }
  network {
    network_id = data.xenorchestra_network.net2.id
  }
  disk {
    sr_id      = lookup(var.host_count, count.index + 3 + 1)
    name_label = "Terraform_disk_imavo"
    size       = 322122547200
  }
}

resource "xenorchestra_vm" "smtp" {
  count            = 3
  cpus             = 4
  memory_max       = 8589934592
  cloud_config     = xenorchestra_cloud_config.node[count.index + 6].template
  name_label       = lookup(var.vm_name, count.index + 6 + 1)
  name_description = "${var.cluster_name} smtp worker"
  template         = data.xenorchestra_template.template.id
  auto_poweron     = true
  affinity_host    = lookup(var.preferred_host, count.index + 6 + 1)
  
  network {
    network_id = data.xenorchestra_network.net0.id
  }
  network {
    network_id = data.xenorchestra_network.net1.id
  }
  network {
    network_id = data.xenorchestra_network.net2.id
  }
  disk {
    sr_id      = lookup(var.host_count, count.index + 6 + 1)
    name_label = "Terraform_disk_imavo"
    size       = 53687091200
  }
}

Jonathon

@MajorP93 My mistake. No I have not rebooted the hosts yet. https://xcp-ng.org/forum/post/107164 I am aware that the reboot is needed for it to be picked up, I went ahead and installed from test so that if host crashes again it is there.

Jonathon

@Mathieu-L Yes, the hosts had been restarted after the updates were installed which included 9.2.16

Jonathon

@Mathieu-L https://xcp-ng.org/forum/post/106873 I had increased it to 16G and it still crashed.

Jonathon

For now I have upgraded all hosts to use --enablerepo=xcp-ng-linstor-testing to get kmod-drbd.x86_64 0:9.2.18-2.0.xcpng8.3
That way it is there to be picked up if the host crashes again.
https://github.com/xcp-ng-rpms/drbd/blob/master/SPECS/kmod-drbd.spec
https://koji.xcp-ng.org/buildinfo?buildID=5747

[12:15 ovbh-pprod-xen04 ~]# yum upgrade --enablerepo=xcp-ng-linstor-testing
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Excluding mirror: updates.xcp-ng.org
 * xcp-ng-base: mirrors.xcp-ng.org
Excluding mirror: updates.xcp-ng.org
 * xcp-ng-updates: mirrors.xcp-ng.org
grafana/signature                                                                                                                                                                                                                      |  810 B  00:00:00     
grafana/signature                                                                                                                                                                                                                      | 3.0 kB  00:00:00 !!! 
xcp-ng-linstor-testing/signature                                                                                                                                                                                                       |  473 B  00:00:00     
xcp-ng-linstor-testing/signature                                                                                                                                                                                                       | 3.0 kB  00:00:00 !!! 
zabbix                                                                                                                                                                                                                                 | 3.0 kB  00:00:00     
zabbix-non-supported                                                                                                                                                                                                                   | 2.9 kB  00:00:00     
(1/3): grafana/primary_db                                                                                                                                                                                                              | 879 kB  00:00:00     
(2/3): xcp-ng-linstor-testing/primary_db                                                                                                                                                                                               | 2.4 kB  00:00:00     
(3/3): zabbix/x86_64/primary_db                                                                                                                                                                                                        | 114 kB  00:00:00     
Resolving Dependencies
--> Running transaction check
---> Package kmod-drbd.x86_64 0:9.2.16-1.0.xcpng8.3 will be updated
---> Package kmod-drbd.x86_64 0:9.2.18-2.0.xcpng8.3 will be an update
---> Package zabbix-agent.x86_64 0:7.0.27-release1.el7 will be updated
---> Package zabbix-agent.x86_64 0:7.0.28-release1.el7 will be an update
--> Finished Dependency Resolution

Dependencies Resolved

==============================================================================================================================================================================================================================================================
 Package                                                    Arch                                                 Version                                                           Repository                                                            Size
==============================================================================================================================================================================================================================================================
Updating:
 kmod-drbd                                                  x86_64                                               9.2.18-2.0.xcpng8.3                                               xcp-ng-linstor-testing                                               2.8 M
 zabbix-agent                                               x86_64                                               7.0.28-release1.el7                                               zabbix                                                               660 k

Transaction Summary
==============================================================================================================================================================================================================================================================
Upgrade  2 Packages

Total download size: 3.4 M
Is this ok [y/d/N]: y
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
(1/2): zabbix-agent-7.0.28-release1.el7.x86_64.rpm                                                                                                                                                                                     | 660 kB  00:00:00     
(2/2): kmod-drbd-9.2.18-2.0.xcpng8.3.x86_64.rpm                                                                                                                                                                                        | 2.8 MB  00:00:01     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                                                                                                                                         2.8 MB/s | 3.4 MB  00:00:01     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Updating   : kmod-drbd-9.2.18-2.0.xcpng8.3.x86_64                                                                                                                                                                                                       1/4 
  Updating   : zabbix-agent-7.0.28-release1.el7.x86_64                                                                                                                                                                                                    2/4 
  Cleanup    : kmod-drbd-9.2.16-1.0.xcpng8.3.x86_64                                                                                                                                                                                                       3/4 
  Cleanup    : zabbix-agent-7.0.27-release1.el7.x86_64                                                                                                                                                                                                    4/4 
  Verifying  : zabbix-agent-7.0.28-release1.el7.x86_64                                                                                                                                                                                                    1/4 
  Verifying  : kmod-drbd-9.2.18-2.0.xcpng8.3.x86_64                                                                                                                                                                                                       2/4 
  Verifying  : kmod-drbd-9.2.16-1.0.xcpng8.3.x86_64                                                                                                                                                                                                       3/4 
  Verifying  : zabbix-agent-7.0.27-release1.el7.x86_64                                                                                                                                                                                                    4/4 

Updated:
  kmod-drbd.x86_64 0:9.2.18-2.0.xcpng8.3                                                                                       zabbix-agent.x86_64 0:7.0.28-release1.el7                                                                                      

Complete!

Jonathon

@poddingue I'm fully up to date, but only at version 9.2.16

[14:57 ovbh-pprod-xen02 ~]# drbdadm --version
DRBDADM_BUILDTAG=GIT-hash:\ 71c8bcff6ea77a022b272a7eba649a774251bac4\ build\ by\ @buildsystem\,\ 2025-11-03\ 10:21:36
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090210
DRBD_KERNEL_VERSION=9.2.16
DRBDADM_VERSION_CODE=0x092100
DRBDADM_VERSION=9.33.0
[14:57 ovbh-pprod-xen02 ~]# drbd-reactor --version
drbd-reactor 1.9.0
[14:57 ovbh-pprod-xen02 ~]#

There are no updates that I can install

[14:08 ovbh-pprod-xen02 ~]# yum check-update
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Excluding mirror: updates.xcp-ng.org
 * xcp-ng-base: mirrors.xcp-ng.org
Excluding mirror: updates.xcp-ng.org
 * xcp-ng-updates: mirrors.xcp-ng.org
zabbix                                                                                                                                                                                                                                 | 3.0 kB  00:00:00     
zabbix-non-supported                                                                                                                                                                                                                   | 2.9 kB  00:00:00     
zabbix/x86_64/primary_db                                                                                                                                                                                                               | 114 kB  00:00:00     

zabbix-agent.x86_64                                                                                                         7.0.28-release1.el7                                                                                                         zabbix

Jonathon

@MajorP93 No we do not have pro support at the moment

hmmm. Really seems like a build with 9.2.19 could fix all of this.
How can I request for a build @poddingue ?

https://github.com/LINBIT/drbd/blob/drbd-9.2.19/ChangeLog#L61
9.2.17

Fix a kernel crash triggered by a crafted/invalid netlink message

https://github.com/LINBIT/drbd/blob/drbd-9.2.19/ChangeLog#L24
9.2.19

Fix an AB-BA deadlock between online resize and activity-log transactions

Fix several races during connection teardown that could crash or
hang (ack_sender requeue, pending ping work, in-progress lb-tcp
connect)

Jonathon

Logs from xen04 look like they are the same too. Which makes sense.

20:01:53.805: The Linstor DeviceManager on xen04 triggered an asynchronous alignment for a newly created volume (pvc-fac0a18b-2c23-4fe8-b005-b4f492cec92f), padding it from 26220040 KiB to 26222592 KiB.

20:01:54.034: Simultaneously, an orchestrator (like Velero) began tearing the volume down, shifting the DRBD state to conn( Connected -> TearDown ) and pdsk( Diskless -> DUnknown ).

20:01:58.484: The DRBD worker thread evaluated the shrinking bitmap while the alignment was trying to alter it. This caused the general protection fault: 0000 [#1] SMP NOPTI at instruction RIP: e030:drbd_bm_count_bits+0x223/0x300 [drbd], crashing the kernel.

Jonathon

I was just moving around the velero backup schedules when it decided to run some again even though it was not time. Crashed xen04 this time. Deleting the possible offending backup schedule until this bug can be identified and fixed.

Uploaded new bug bundle to nextcloud too.

Jonathon

My first idea was to have velero use a different storage class. Something like

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: linstor-ephemeral-retain
parameters:
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "true"
  linstor.csi.linbit.com/disklessOnRemaining: "true"
  linstor.csi.linbit.com/disklessStoragePool: DfltDisklessStorPool
  linstor.csi.linbit.com/placementCount: "3"
provisioner: linstor.csi.linbit.com
**reclaimPolicy: Retain**
volumeBindingMode: WaitForFirstConsumer

Then have a cronjob collected the Released pvcs every hour. The problem is that can not work, as velero uses the storage class of the backed up pvc. So I would have to change every linstor storage class to Retain, and would effectively disable the automatic garbage collection for the cluster. Which might be fine, but would love to avoid it.

My second idea is to eliminate the need for the resizing (another work around to hopefully get away from this race condition/bug)
I am also not seeing any way to modify the storageclass to round up or pad the pv. This would be to attempt to eliminate the need for drbd_bm_resize.
Currently looking into Kyverno Mutating Webhook to dynamically pad these volumes on the fly to see if that would help me out at all.
All pvs are already whole numbers of Gi.
Velero calculates the size of the temporary clone pvc based on the exact byte-count of the used/allocated snapshot metadata, not the #Gi specification of the original pvc.
Just confirmed with a test

Jonathon

Looks to be similar to last time

Time,Node,Component / Service,Logged Event & Impact
23:17:13,xen01,LINSTOR/Controller,Resource Created: linstor-csi provisions a short-lived volume: pvc-395da199-d455-4004-ac0d-f6c6016ebf4e.
23:17:16,xen01,kernel (drbd),Sync Begins: xen01 connects to xen05 as a SyncSource and begins synchronizing the volume to it.
23:17:49,xen01,LINSTOR/Controller,Teardown Requested: The CSI plugin changes its mind or finishes its task, requesting sequential deletion (DelRsc) of the volume replicas across the cluster.
23:17:51,xen01,LINSTOR/Controller,Source Disk Removed: Linstor removes the volume from xen01 itself (Toggle Disk ... removing disk).
23:17:52,xen01,kernel (drbd),Source Goes Diskless: xen01 detaches the underlying storage block (UpToDate -> Detaching -> Diskless). It alerts peers it can no longer satisfy read requests, calling drbd_bm_resize with capacity == 0.
23:17:53,xen05,LINSTOR/Satellite, While the cluster connection is actively tearing down, xen05's local background DeviceManager wakes up and detects the volume. It attempts an online re-alignment to round up the block layer size to match the 4MiB extent rule (26220040 KiB to 26222592 KiB).
23:17:54,xen05,kernel (drbd),Fatal Crash: xen05's DRBD driver tries to resize and count its synchronization bitmap block pointers. Because its peer source (xen01) dropped to zero capacity mid-flight, DRBD hits memory corruption. It triggers a General Protection Fault at drbd_bm_count_bits+0x223/0x300 and panics the kernel.

Jonathon

@Jonathon