XOSTOR hyperconvergence preview

TheiLLeniumStudios

@ronan-a said in XOSTOR hyperconvergence preview:

Did you just execute a xe sr-forget command on the SR? In this case the volumes are not removed. xe sr-destroy must be used to remove the volumes. So you can execute xe sr-introduce and then xe sr-destroy to clean your hosts.

I did it using XO interface. Now it doesn't show up and when I tried your suggestion of running xe sr-introduce, it just creates an "Unknown" SR and doesn't link it to the previous one. Running xe sr-create also doesn't help since that errors out with LINSTOR SR creation error [opterr=LINSTOR SR must be unique in a pool]

Can you elaborate the steps for reintroducing a lost SR that is backed by linstor? I ran this command:

 xe sr-introduce type=linstor name-label=XOSTOR uuid=41ba4c11-8c13-30b3-fcbb-7668a39825a6

The UUID is the original UUID of the XOSTOR SR which I found in my command history. And running xe sr-destroy after the above introduce leads to The SR has no attached PBDs. My disks look like this at the moment:

NAME                                                                                              MAJ:MIN  RM   SIZE RO TYPE MOUNTPOINT
drbd1016                                                                                          147:1016  0    10G  0 disk
drbd1014                                                                                          147:1014  0    10G  0 disk
sdb                                                                                                 8:16    0 238.5G  0 disk
|-linstor_group-thin_device_tmeta                                                                 253:1     0   120M  0 lvm
| `-linstor_group-thin_device-tpool                                                               253:3     0 238.2G  0 lvm
|   |-linstor_group-xcp--persistent--redo--log_00000                                              253:10    0   260M  0 lvm
|   | `-drbd1002                                                                                  147:1002  0 259.7M  0 disk
|   |-linstor_group-xcp--persistent--database_00000                                               253:8     0     1G  0 lvm
|   | `-drbd1000                                                                                  147:1000  0     1G  0 disk /var/lib/linstor
|   |-linstor_group-thin_device                                                                   253:4     0 238.2G  0 lvm
|   |-linstor_group-xcp--volume--13a94a7a--d433--4426--8232--812e3c6dc52e_00000                   253:11    0    10G  0 lvm
|   | `-drbd1004                                                                                  147:1004  0    10G  0 disk
|   |-linstor_group-xcp--persistent--ha--statefile_00000                                          253:9     0     8M  0 lvm
|   | `-drbd1001                                                                                  147:1001  0     8M  0 disk
|   |-linstor_group-xcp--volume--70bf80a2--a008--469a--a7db--0ea92fcfc392_00000                   253:5     0    20M  0 lvm
|   | `-drbd1009                                                                                  147:1009  0    20M  0 disk
|   `-linstor_group-xcp--volume--4b70d69b--9cca--4aa3--842f--09366ac76901_00000                   253:12    0    10G  0 lvm
|     `-drbd1006                                                                                  147:1006  0    10G  0 disk
`-linstor_group-thin_device_tdata                                                                 253:2     0 238.2G  0 lvm
  `-linstor_group-thin_device-tpool                                                               253:3     0 238.2G  0 lvm
    |-linstor_group-xcp--persistent--redo--log_00000                                              253:10    0   260M  0 lvm
    | `-drbd1002                                                                                  147:1002  0 259.7M  0 disk
    |-linstor_group-xcp--persistent--database_00000                                               253:8     0     1G  0 lvm
    | `-drbd1000                                                                                  147:1000  0     1G  0 disk /var/lib/linstor
    |-linstor_group-thin_device                                                                   253:4     0 238.2G  0 lvm
    |-linstor_group-xcp--volume--13a94a7a--d433--4426--8232--812e3c6dc52e_00000                   253:11    0    10G  0 lvm
    | `-drbd1004                                                                                  147:1004  0    10G  0 disk
    |-linstor_group-xcp--persistent--ha--statefile_00000                                          253:9     0     8M  0 lvm
    | `-drbd1001                                                                                  147:1001  0     8M  0 disk
    |-linstor_group-xcp--volume--70bf80a2--a008--469a--a7db--0ea92fcfc392_00000                   253:5     0    20M  0 lvm
    | `-drbd1009                                                                                  147:1009  0    20M  0 disk
    `-linstor_group-xcp--volume--4b70d69b--9cca--4aa3--842f--09366ac76901_00000                   253:12    0    10G  0 lvm
      `-drbd1006                                                                                  147:1006  0    10G  0 disk
drbd1012                                                                                          147:1012  0    10G  0 disk
tda                                                                                               254:0     0    10G  0 disk
drbd1015                                                                                          147:1015  0    10G  0 disk
drbd1005                                                                                          147:1005  0    20M  0 disk
sda                                                                                                 8:0     0 223.6G  0 disk
|-sda4                                                                                              8:4     0   512M  0 part /boot/efi
|-sda2                                                                                              8:2     0    18G  0 part
|-sda5                                                                                              8:5     0     4G  0 part /var/log
|-sda3                                                                                              8:3     0 182.1G  0 part
| `-XSLocalEXT--712c1f83--d11f--ae07--d2b8--14a823761e6e-712c1f83--d11f--ae07--d2b8--14a823761e6e 253:0     0 182.1G  0 lvm  /run/sr-mount/712c1f83-d11f-ae07-d2b8-14a823761e6e
|-sda1                                                                                              8:1     0    18G  0 part /
`-sda6                                                                                              8:6     0     1G  0 part [SWAP]
tdb                                                                                               254:1     0    50G  0 disk

@ronan-a said in XOSTOR hyperconvergence preview:

What do you mean? What's the performance problem? When the VM starts ? During execution?

So, with XOSTOR created and making the Pool HA with that SR, whenever I created a new VM in that SR and not choose any affinity host, it takes atleast 10-15 minutes to run a migrate task which shouldn't be necessary because XOSTOR is shared right? Or is my assumption not correct? And once the job is done, it doesn't even start all the VMs and some even disappeared from XO. I was creating the VMs using terraform and it spun up 6 of them at a time and since the migrate task started for all of them, I saw the error TOO_MANY_STORAGE_MIGRATES. Not really sure what's going on.

I first thought it was my template VDI that wasn't on XOSTOR but I reuploaded the cloud image to XOSTOR but still got the same behavior. And I'm not even using spinning disks, 2 of hosts have NVMe drives and 1 of them involved in XOSTOR has an mSATA one

ronan-a

@TheiLLeniumStudios Better explanations, to reintroduce a LINSTOR SR, you can use these commands with you own parameters:

Generate a new SR UUID.

[10:18 r620-s1 ~]# uuidgen
345adcd2-aa2b-44ad-9c25-788cf870db72

[10:18 r620-s1 ~]# xe sr-introduce uuid=345adcd2-aa2b-44ad-9c25-788cf870db72 type=linstor name-label="XOSTOR" content-type=user
345adcd2-aa2b-44ad-9c25-788cf870db72

# Get host UUIDs.
[10:18 r620-s1 ~]# xe host-list
uuid ( RO)                : 888254e8-da05-4f86-ad37-979b8d6bad04
          name-label ( RW): R620-S2
    name-description ( RW): Default install


uuid ( RO)                : c96ec4dd-28ac-4df4-b73c-4371bd202728
          name-label ( RW): R620-S1
    name-description ( RW): Default install


uuid ( RO)                : ddcd3461-7052-4f5e-932c-e1ed75c192d6
          name-label ( RW): R620-S3
    name-description ( RW): Default install

Create the PBDs using the same old config.

[10:19 r620-s1 ~]# xe pbd-create sr-uuid=345adcd2-aa2b-44ad-9c25-788cf870db72 host-uuid=c96ec4dd-28ac-4df4-b73c-4371bd202728 device-config:hosts=r620-s1,r620-s2,r620-s3 device-config:group-name=linstor_group/thin_device device-config:redundancy=2 device-config:provisioning=thin
1c5c030a-1823-d53a-d8df-6c50af6beb2b

[10:19 r620-s1 ~]# xe pbd-create sr-uuid=345adcd2-aa2b-44ad-9c25-788cf870db72 host-uuid=888254e8-da05-4f86-ad37-979b8d6bad04 device-config:hosts=r620-s1,r620-s2,r620-s3 device-config:group-name=linstor_group/thin_device device-config:redundancy=2 device-config:provisioning=thin
4c5df60a-f96d-19c2-44b0-f5951388d502

[10:20 r620-s1 ~]# xe pbd-create sr-uuid=345adcd2-aa2b-44ad-9c25-788cf870db72 host-uuid=ddcd3461-7052-4f5e-932c-e1ed75c192d6 device-config:hosts=r620-s1,r620-s2,r620-s3 device-config:group-name=linstor_group/thin_device device-config:redundancy=2 device-config:provisioning=thin
584d033c-7bad-ebc8-30dd-1888ea2bea29

If you don't know what's your group name, you can use vgs/lvs, in my case I use thin provisioning, so I have an associated volume:

[10:21 r620-s1 ~]# vgs
  VG            #PV #LV #SN Attr   VSize   VFree
  linstor_group   1   5   0 wz--n- 931.51g    0
[10:21 r620-s1 ~]# lvs
  LV                                                    VG            Attr       LSize    Pool        Origin Data%  Meta%  Move Log Cpy%Sync Convert
  thin_device                                           linstor_group twi-aotz-- <931.28g                    0.29   10.55
  ...

I did it using XO interface. Now it doesn't show up and when I tried your suggestion of running xe sr-introduce, it just creates an "Unknown" SR and doesn't link it to the previous one. Running xe sr-create also doesn't help since that errors out with LINSTOR SR creation error [opterr=LINSTOR SR must be unique in a pool]

If you have this error, the LINSTOR PBDs still exist. Are you sure you forgot the previous SR?

So, with XOSTOR created and making the Pool HA with that SR, whenever I created a new VM in that SR and not choose any affinity host, it takes atleast 10-15 minutes to run a migrate task which shouldn't be necessary because XOSTOR is shared right?

Regarding the migration, it's correct if the SR is created with shared=true, and if you migrate the VM between two hosts with the same SR used, the migration should be short.
After repairing your SR, you can do a VM migration and send me the logs of the machines if you want, I can take a look.

ronan-a

@AudleyElwine If you forgot your SR, you can follow the instructions I gave in the previous message. Otherwise, check if the PBDs are correctly plugged to the SR, it's probably not the case.

AudleyElwine

@ronan-a I did not forget the SR so yeah it is the PBDs.
I tried to plug them back with

xe pbd-plug uuid=...

taking the uuid from the

xe pbd-list sr-uuid=xostor-uuid

I was able to plug three hosts, however the last forth host says the following.

Error code: SR_BACKEND_FAILURE_1200
Error parameters: , Cannot update volume uuid 36a23780-2025-4f3f-bade-03c410e63368 to 45537c14-0125-4f6c-a1ad-476552888087: this last one is not empty,

What do you think I should do to make the forth host pbd connect to delete the SR correctly?

TheiLLeniumStudios

@ronan-a I just tried to reintroduce the SR and I got no errors while running xe pdb-create but it still shows up as a -1 Size SR. I think I might have corrupted the metadata as checking lvs, vgs and pvs throw errors:

[11:09 xcp-ng-node-1 ~]# lvs
  /dev/drbd1014: open failed: No data available
  LV                                                    VG                                              Attr       LSize    Pool        Origin Data%  Meta%  Move Log Cpy%Sync Convert
  712c1f83-d11f-ae07-d2b8-14a823761e6e                  XSLocalEXT-712c1f83-d11f-ae07-d2b8-14a823761e6e -wi-ao---- <182.06g                                                           
  thin_device                                           linstor_group                                   twi-aotz-- <238.24g                    1.64   11.27                           
  xcp-persistent-database_00000                         linstor_group                                   Vwi-aotz--    1.00g thin_device        0.84                                   
  xcp-persistent-ha-statefile_00000                     linstor_group                                   Vwi-aotz--    8.00m thin_device        6.25                                   
  xcp-persistent-redo-log_00000                         linstor_group                                   Vwi-aotz--  260.00m thin_device        0.53                                   
  xcp-volume-13a94a7a-d433-4426-8232-812e3c6dc52e_00000 linstor_group                                   Vwi-aotz--   10.03g thin_device        0.14                                   
  xcp-volume-4b70d69b-9cca-4aa3-842f-09366ac76901_00000 linstor_group                                   Vwi-aotz--   10.03g thin_device        38.67                                  
  xcp-volume-70bf80a2-a008-469a-a7db-0ea92fcfc392_00000 linstor_group                                   Vwi-aotz--   20.00m thin_device        71.88                                  
[11:09 xcp-ng-node-1 ~]# vgs
  /dev/drbd1014: open failed: No data available
  VG                                              #PV #LV #SN Attr   VSize    VFree
  XSLocalEXT-712c1f83-d11f-ae07-d2b8-14a823761e6e   1   1   0 wz--n- <182.06g    0 
  linstor_group                                     1   7   0 wz--n-  238.47g    0 
[11:09 xcp-ng-node-1 ~]# pvs
  /dev/drbd1014: open failed: No data available
  PV         VG                                              Fmt  Attr PSize    PFree
  /dev/sda3  XSLocalEXT-712c1f83-d11f-ae07-d2b8-14a823761e6e lvm2 a--  <182.06g    0 
  /dev/sdb   linstor_group                                   lvm2 a--   238.47g    0 
[11:09 xcp-ng-node-1 ~]# lsblk
NAME                                                     MAJ:MIN  RM   SIZE RO TYPE MOUNTPOINT
drbd1016                                                 147:1016  0    10G  0 disk 
drbd1014                                                 147:1014  0    10G  0 disk 
sdb                                                        8:16    0 238.5G  0 disk 
|-linstor_group-thin_device_tmeta                        253:1     0   120M  0 lvm  
| `-linstor_group-thin_device-tpool                      253:3     0 238.2G  0 lvm  
|   |-linstor_group-xcp--persistent--redo--log_00000     253:10    0   260M  0 lvm  
|   | `-drbd1002                                         147:1002  0 259.7M  0 disk 
|   |-linstor_group-xcp--persistent--database_00000      253:8     0     1G  0 lvm  
|   | `-drbd1000                                         147:1000  0     1G  0 disk /var/lib/linstor
|   |-linstor_group-thin_device                          253:4     0 238.2G  0 lvm  
|   |-linstor_group-xcp--volume--13a94a7a--d433--4426--8232--812e3c6dc52e_00000
                                                         253:11    0    10G  0 lvm  
|   | `-drbd1004                                         147:1004  0    10G  0 disk 
|   |-linstor_group-xcp--persistent--ha--statefile_00000 253:9     0     8M  0 lvm  
|   | `-drbd1001                                         147:1001  0     8M  0 disk 
|   |-linstor_group-xcp--volume--70bf80a2--a008--469a--a7db--0ea92fcfc392_00000
                                                         253:5     0    20M  0 lvm  
|   | `-drbd1009                                         147:1009  0    20M  0 disk 
|   `-linstor_group-xcp--volume--4b70d69b--9cca--4aa3--842f--09366ac76901_00000
                                                         253:12    0    10G  0 lvm  
|     `-drbd1006                                         147:1006  0    10G  0 disk 
`-linstor_group-thin_device_tdata                        253:2     0 238.2G  0 lvm  
  `-linstor_group-thin_device-tpool                      253:3     0 238.2G  0 lvm  
    |-linstor_group-xcp--persistent--redo--log_00000     253:10    0   260M  0 lvm  
    | `-drbd1002                                         147:1002  0 259.7M  0 disk 
    |-linstor_group-xcp--persistent--database_00000      253:8     0     1G  0 lvm  
    | `-drbd1000                                         147:1000  0     1G  0 disk /var/lib/linstor
    |-linstor_group-thin_device                          253:4     0 238.2G  0 lvm  
    |-linstor_group-xcp--volume--13a94a7a--d433--4426--8232--812e3c6dc52e_00000
                                                         253:11    0    10G  0 lvm  
    | `-drbd1004                                         147:1004  0    10G  0 disk 
    |-linstor_group-xcp--persistent--ha--statefile_00000 253:9     0     8M  0 lvm  
    | `-drbd1001                                         147:1001  0     8M  0 disk 
    |-linstor_group-xcp--volume--70bf80a2--a008--469a--a7db--0ea92fcfc392_00000
                                                         253:5     0    20M  0 lvm  
    | `-drbd1009                                         147:1009  0    20M  0 disk 
    `-linstor_group-xcp--volume--4b70d69b--9cca--4aa3--842f--09366ac76901_00000
                                                         253:12    0    10G  0 lvm  
      `-drbd1006                                         147:1006  0    10G  0 disk 
drbd1012                                                 147:1012  0    10G  0 disk 
tda                                                      254:0     0    10G  0 disk 
drbd1015                                                 147:1015  0    10G  0 disk 
drbd1005                                                 147:1005  0    20M  0 disk 
sda                                                        8:0     0 223.6G  0 disk 
|-sda4                                                     8:4     0   512M  0 part /boot/efi
|-sda2                                                     8:2     0    18G  0 part 
|-sda5                                                     8:5     0     4G  0 part /var/log
|-sda3                                                     8:3     0 182.1G  0 part 
| `-XSLocalEXT--712c1f83--d11f--ae07--d2b8--14a823761e6e-712c1f83--d11f--ae07--d2b8--14a823761e6e
                                                         253:0     0 182.1G  0 lvm  /run/sr-mount/712c1f83-d11f-ae07-d2b8-14a82376
|-sda1                                                     8:1     0    18G  0 part /
`-sda6                                                     8:6     0     1G  0 part [SWAP]
tdb                                                      254:1     0    50G  0 disk 
[11:09 xcp-ng-node-1 ~]#

Is it possible to clean up the partition table and recreate it some other way without having to reinstall xcp-ng on the machines? As using wipefs -a says that the device is in use so I cannot wipe the partitions

ronan-a

@AudleyElwine said in XOSTOR hyperconvergence preview:

Ho! Sounds like a bug fixed in the latest beta... In this case, ensure there is no VM running, and download this script:

wget https://gist.githubusercontent.com/Wescoeur/3b5c399b15c4d700b4906f12b51e2591/raw/452acd9ebcd52c62020e796302c681590b37cd3f/gistfile1.txt -O linstor-kv-tool && chmod +x linstor-kv-tool

Find where is the running linstor-controller, so execute this command on each host:

[11:13 r620-s1 ~]# mountpoint /var/lib/linstor
/var/lib/linstor is a mountpoint

If it's a mounpoint, you found it. Now, you must execute the script using the local IP of this host, for example:

./linstor-kv-tool --dump-volumes -u 172.16.210.16 -g xcp-sr-linstor_group_thin_device

The group to use is equal to: <VG_name>_<LV_thin_name>. Or just <VG_name> if you don't use thin provisioning.
Note: there was a bug in the previous beta, you must double the xcp-sr- prefix. (Example: xcp-sr-xcp-sr-linstor_group_thin_device)

So if you have an output using this script with many entries, you can run --remove-all-volumes instead of --dump-volumes. This command should remove the properties in the LINSTOR KV-store. After that you can dump a new time to verify.

Now, you can execute a scan on the SR. After that, it's necessary to remove all resource definitions using the linstor binary.

Get the list using:

linstor --controllers=<CONTROLLER_IP> resource-definition list
╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                                    ┊ Port ┊ ResourceGroup                    ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ xcp-persistent-database                         ┊ 7000 ┊ xcp-sr-linstor_group_thin_device ┊ ok    ┊
┊ xcp-volume-0db304a1-89a2-45df-a39d-7c5c39a87c5f ┊ 7006 ┊ xcp-sr-linstor_group_thin_device ┊ ok    ┊
┊ xcp-volume-6289f306-ab2b-4388-a5a2-a20ba18698f8 ┊ 7005 ┊ xcp-sr-linstor_group_thin_device ┊ ok    ┊
┊ xcp-volume-73b9a396-c67f-48b3-8774-f60f1c2af598 ┊ 7001 ┊ xcp-sr-linstor_group_thin_device ┊ ok    ┊
┊ xcp-volume-a46393ef-428d-4af8-9c0e-30b0108bd21a ┊ 7003 ┊ xcp-sr-linstor_group_thin_device ┊ ok    ┊
┊ xcp-volume-b83db8cf-ea3b-47aa-ad77-89b5cd9a1853 ┊ 7002 ┊ xcp-sr-linstor_group_thin_device ┊ ok    ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯

Then execute linstor resource-definition delete <VOLUME> on each volume. But don't do that on the xcp-persistent-database, only on xcp-volume-XXX!

Normally after all these steps, you can destroy the SR properly! I think I will write an automated version for later, like linstor-emergency-destroy.

ronan-a

@TheiLLeniumStudios Can you plug the PBDs? If there is no issue here, you can follow the same steps as AudleyElwine.

AudleyElwine

@ronan-a Thank you for the detailed steps.

I get the following output when i dump my volumes using the xcp-sr-linstor_group_thin_device group

{
  "xcp/sr/journal/clone/0fb10e9f-b9ef-4b59-8b31-9330f0785514": "86b1b2af-8f1d-4155-9961-d06bbacbb7aa_0e121812-fcae-4d70-960f-ac440b3927e3",
  "xcp/sr/journal/clone/14131ee4-2956-47b7-8728-c9790764f71a": "dfb43813-91eb-46b8-9d56-22c8dbb485fc_917177d5-d03b-495c-b2db-fd62d3d25b86",
  "xcp/sr/journal/clone/45537c14-0125-4f6c-a1ad-476552888087": "36a23780-2025-4f3f-bade-03c410e63368_3e419764-9c8c-4539-9a42-be96f92e5c2a",
  "xcp/sr/journal/clone/54ec7009-2424-4299-a9ad-fb015600b88c": "af89f0fc-7d5a-4236-b249-8d9408f5fb6d_f32f2e8f-a43f-43f5-824b-f673a5cbd988",
  "xcp/sr/journal/clone/558220bc-a900-4408-a62e-a71a4bb4fd7b": "d9294359-c395-4bed-ac3a-bf4027c92bd9_0e18bf3d-78f0-4843-9e8f-ee11c6ebbf5a",
  "xcp/sr/journal/clone/c41e0d47-5c1a-45c3-9404-01f3b5735c0d": "e191eb57-2478-4e3b-be9d-e8eaba8f9efe_41eae673-a280-439b-a4c6-f3afe2390fde",
  "xcp/sr/journal/relink/50170fa2-2ca9-4218-8217-5c99ac31f10b": "1"
}

but the --remove-all-volumes does not delete them because they dont start with xcp/volume/.

Also when i placed xcp-sr-xcp-sr-linstor_group_thin_device a lot of volumes appeared similar to the following

{
  "volume/00897d74-53c9-41b4-8f5f-73132e4a9af7/metadata": "{\"read_only\": true, \"snapshot_time\": \"\", \"vdi_type\": \"vhd\", \"snapshot_of\": null, \"name_label\": \"base copy\", \"name_description\": \"\", \"type\": \"user\", \"metadata_of_pool\": \"\", \"is_a_snapshot\": false}",
  "volume/00897d74-53c9-41b4-8f5f-73132e4a9af7/not-exists": "0",
  "volume/00897d74-53c9-41b4-8f5f-73132e4a9af7/volume-name": "xcp-volume-2892500d-d80a-4978-aa87-ab2b39ace9e9",
  "volume/00b0dbb5-2dfa-4fd5-baf4-81065afa2431/metadata": "{\"read_only\": true, \"snapshot_time\": \"\", \"vdi_type\": \"vhd\", \"snapshot_of\": null, \"name_label\": \"base copy\", \"name_description\": \"\", \"type\": \"user\", \"metadata_of_pool\": \"\", \"is_a_snapshot\": false}",
  "volume/00b0dbb5-2dfa-4fd5-baf4-81065afa2431/not-exists": "0",
...
...
...
  "volume/fcbcd0dc-8d90-441e-8d03-e435ac417b96/not-exists": "0",
  "volume/fcbcd0dc-8d90-441e-8d03-e435ac417b96/volume-name": "xcp-volume-f3748b88-1b25-4f18-8f63-4017b09f2ac6",
  "volume/fce3b2e0-1025-4c94-9473-e71562ca11bd/metadata": "{\"read_only\": true, \"snapshot_time\": \"\", \"vdi_type\": \"vhd\", \"snapshot_of\": null, \"name_label\": \"base copy\", \"name_description\": \"\", \"type\": \"user\", \"metadata_of_pool\": \"\", \"is_a_snapshot\": false}",
  "volume/fce3b2e0-1025-4c94-9473-e71562ca11bd/not-exists": "0",
  "volume/fce3b2e0-1025-4c94-9473-e71562ca11bd/volume-name": "xcp-volume-08f1fb0b-d6a3-47eb-893b-6c8b08417726",
  "volume/fe6bc8fd-4211-4b4a-8ee5-ba55a7641053/metadata": "{\"read_only\": true, \"snapshot_time\": \"\", \"vdi_type\": \"vhd\", \"snapshot_of\": null, \"name_label\": \"base copy\", \"name_description\": \"\", \"type\": \"user\", \"metadata_of_pool\": \"\", \"is_a_snapshot\": false}",
  "volume/fe6bc8fd-4211-4b4a-8ee5-ba55a7641053/not-exists": "0",
  "volume/fe6bc8fd-4211-4b4a-8ee5-ba55a7641053/volume-name": "xcp-volume-7a46e0f4-0f61-4a37-b235-1d2bd9eaf033",
  "volume/fe8dc6e6-a2c6-449a-8858-255a37cc8f98/metadata": "{\"read_only\": true, \"snapshot_time\": \"\", \"vdi_type\": \"vhd\", \"snapshot_of\": \"\", \"name_label\": \"base copy\", \"name_description\": \"\", \"type\": \"user\", \"metadata_of_pool\": \"\", \"is_a_snapshot\": false}",
  "volume/fe8dc6e6-a2c6-449a-8858-255a37cc8f98/not-exists": "0",
  "volume/fe8dc6e6-a2c6-449a-8858-255a37cc8f98/volume-name": "xcp-volume-0290c420-9f14-43ae-9af5-fe333b60c7dc",
  "volume/feadfc8d-5aeb-429c-8335-4530aa24cc86/metadata": "{\"read_only\": true, \"snapshot_time\": \"\", \"vdi_type\": \"vhd\", \"snapshot_of\": null, \"name_label\": \"base copy\", \"name_description\": \"\", \"type\": \"user\", \"metadata_of_pool\": \"\", \"is_a_snapshot\": false}",
  "volume/feadfc8d-5aeb-429c-8335-4530aa24cc86/not-exists": "0",
  "volume/feadfc8d-5aeb-429c-8335-4530aa24cc86/volume-name": "xcp-volume-6a37cf38-d6e4-4af3-90d7-84bec3938b20",
  "xcp/sr/metadata": "{\"name_description\": \"\", \"name_label\": \"XOSTOR\"}"
}

and the --remove-all-volumes also does not work on them.
I did the following with and without the xcp-sr prefix and it produced an empty json when specifiying the namesapce as /xcp/volume to match the startwith in the delete thingy.

./linstor-kv-tool --dump-volumes -u 192.168.0.106 -g xcp-sr-linstor_group_thin_device -n /xcp/volume

What do you think I should do?

ronan-a

@AudleyElwine said in XOSTOR hyperconvergence preview:

but the --remove-all-volumes does not delete them because they dont start with xcp/volume/.

Right, another problem already fixed, but I forgot to put an adapted version on my gist, sorry, you can modify the script to use volume/ instead of xcp/volume.

AudleyElwine

@ronan-a Thank you for your fast support.

I did these changes

diff -u linstor-kv-tool linstor-kv-tool-modified
--- linstor-kv-tool	2022-11-17 18:57:00.941259380 +0800
+++ linstor-kv-tool-modified	2022-11-17 19:04:15.957504667 +0800
@@ -33,7 +33,7 @@
     kv = linstor.KV(
         group_name,
         uri=controller_uri,
-        namespace='/xcp/volume/{}'.format(vdi_name)
+        namespace='/volume/{}'.format(vdi_name)
     )

     for key, value in list(kv.items()):
@@ -46,11 +46,11 @@
         uri=controller_uri,
         namespace='/'
     )
-
     for key, value in list(kv.items()):
-        if key.startswith('xcp/volume/'):
+        if key.startswith('volume/'):
             size = key.rindex('/')
             kv.namespace = key[:size]
+            print("key is {}".format(repr(key[size + 1:])))
             del kv[key[size + 1:]]

and I got the following error.

./linstor-kv-tool-modified --remove-all-volumes -u 192.168.0.106 -g xcp-sr-xcp-sr-linstor_group_thin_device
key is u'metadata'
Traceback (most recent call last):
  File "./linstor-kv-tool-modified", line 78, in <module>
    main()
  File "./linstor-kv-tool-modified", line 74, in main
    remove_all_volumes(args.uri, args.group_name)
  File "./linstor-kv-tool-modified", line 54, in remove_all_volumes
    del kv[key[size + 1:]]
  File "/usr/lib/python2.7/site-packages/linstor/kv.py", line 151, in __delitem__
    self._del_linstor_kv(k)
  File "/usr/lib/python2.7/site-packages/linstor/kv.py", line 89, in _del_linstor_kv
    raise linstor.LinstorError('Could not delete kv({}): {}'.format(k, rs[0]))
linstor.errors.LinstorError: Error: Could not delete kv(/volume/aec2104e-e501-4d7d-b0fb-95a80e843e0a/metadata): ERRO:Exception thrown.

and I can confirm the volume exist when I dump all of them

"volume/aec2104e-e501-4d7d-b0fb-95a80e843e0a/metadata": "{\"read_only\": true, \"snapshot_time\": \"\", \"vdi_type\": \"vhd\", \"snapshot_of\": \"\", \"name_label\": \"base copy\", \"name_description\": \"\", \"type\": \"user\", \"metadata_of_pool\": \"\", \"is_a_snapshot\": false}",
  "volume/aec2104e-e501-4d7d-b0fb-95a80e843e0a/not-exists": "0",
  "volume/aec2104e-e501-4d7d-b0fb-95a80e843e0a/volume-name": "xcp-volume-b1748285-7cda-429f-b230-50dfba161e9c",

May I ask what do you recommend me to do? And thank you for your continues support.

ronan-a

@AudleyElwine said in XOSTOR hyperconvergence preview:
Really strange... Maybe there is a lock or another issue with LINSTOR. In the worst case you can retry after a reboot of all hosts. If it's always stuck I can take a look using a support tunnel, I'm not sure to understand why you have this error.

AudleyElwine

@ronan-a I started updating xcp-ng so it can both restart and update on my four nodes (eva, phoebe, mike, ozly).
The nodes were updated with the rolling method, and all three node updated fine, except the forth (mike) (different that the ones that refuses to connect the PBD(eva)) since it is task was stuck at 0.000 progress for 3 hours, so i restarted the toolstack for it(mike) but it didnt do anything, so i restarted the master(eva) node stack. Then when I went to manually update it from XOA, it gives me this error.

-1(global name 'commmand' is not defined, , Traceback (most recent call last):
  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 101, in wrapper
    return func(*args, **kwds)
  File "/etc/xapi.d/plugins/updater.py", line 96, in decorator
    return func(*args, **kwargs)
  File "/etc/xapi.d/plugins/updater.py", line 157, in update
    raise error
NameError: global name 'commmand' is not defined
)

The good news is, the linstor controller have moved to a different node(phoebe) from the old one(mike) and I was able to delete all volumes in the linstor --controllers=... resource-definition list except for the database, yet the PBD(eva) could not be connected. And the XOA still shows me a lot of disk, and when I scan it I get this error SR_HAS_NO_PBDS.

So now mike server cant update, and eva server cant connect its PBDs while all the other servers are connected. Note eva was the server that I started my linstor installation on.

Do you have any thoughts on what I can do to fix this without reinstalling xcp-ng on mike?

AudleyElwine

Figured out the issue, when I tried to update it from the cli instead. the /var/log partition was full due to /var/log/linstor-controller having something like 3.5G+ data (90% of the /var/log volume). maybe it is due to the past errors it accumulated. I deleted these logs and mike updated normally.

Now regarding plugging the PBD to eva (the one host that is not connecting to it). it says the following error.

Error code: SR_BACKEND_FAILURE_202
Error parameters: , General backend error [opterr=Base copy 36a23780-2025-4f3f-bade-03c410e63368 not present, but no original 45537c14-0125-4f6c-a1ad-476552888087 found],

this is what linstor resource-definition is showing

[03:59 eva ~]# linstor --controllers=192.168.0.108 resource-definition list -p
+---------------------------------------------------------------------------+
| ResourceName            | Port | ResourceGroup                    | State |
|===========================================================================|
| xcp-persistent-database | 7000 | xcp-sr-linstor_group_thin_device | ok    |
+---------------------------------------------------------------------------+

And here is the KV store for linstor from that script

[04:01 phoebe ~]# mountpoint /var/lib/linstor
/var/lib/linstor is a mountpoint
[04:01 phoebe ~]# ./linstor-kv-tool-modified --dump-volumes -u 192.168.0.108 -g xcp-sr-xcp-sr-linstor_group_thin_device
{
  "xcp/sr/metadata": "{\"name_description\": \"\", \"name_label\": \"XOSTOR\"}"
}
[04:01 phoebe ~]# ./linstor-kv-tool-modified --dump-volumes -u 192.168.0.108 -g xcp-sr-linstor_group_thin_device
{
  "xcp/sr/journal/clone/0fb10e9f-b9ef-4b59-8b31-9330f0785514": "86b1b2af-8f1d-4155-9961-d06bbacbb7aa_0e121812-fcae-4d70-960f-ac440b3927e3",
  "xcp/sr/journal/clone/14131ee4-2956-47b7-8728-c9790764f71a": "dfb43813-91eb-46b8-9d56-22c8dbb485fc_917177d5-d03b-495c-b2db-fd62d3d25b86",
  "xcp/sr/journal/clone/45537c14-0125-4f6c-a1ad-476552888087": "36a23780-2025-4f3f-bade-03c410e63368_3e419764-9c8c-4539-9a42-be96f92e5c2a",
  "xcp/sr/journal/clone/54ec7009-2424-4299-a9ad-fb015600b88c": "af89f0fc-7d5a-4236-b249-8d9408f5fb6d_f32f2e8f-a43f-43f5-824b-f673a5cbd988",
  "xcp/sr/journal/clone/558220bc-a900-4408-a62e-a71a4bb4fd7b": "d9294359-c395-4bed-ac3a-bf4027c92bd9_0e18bf3d-78f0-4843-9e8f-ee11c6ebbf5a",
  "xcp/sr/journal/clone/c41e0d47-5c1a-45c3-9404-01f3b5735c0d": "e191eb57-2478-4e3b-be9d-e8eaba8f9efe_41eae673-a280-439b-a4c6-f3afe2390fde",
  "xcp/sr/journal/relink/50170fa2-2ca9-4218-8217-5c99ac31f10b": "1"
}

I destroyed the PBD and then recreated it to make it just connect so I can destroy the SR, but the same error happened when I tried to connect with the new PBD that has the same config as the other PBD

AudleyElwine

those two UUIDs are in the ./linstor-kv-tool-modified --dump-volumes -u 192.168.0.108 -g xcp-sr-linstor_group_thin_device device output

./linstor-kv-tool-modified --dump-volumes -u 192.168.0.108 -g xcp-sr-linstor_group_thin_device
{
  "xcp/sr/journal/clone/0fb10e9f-b9ef-4b59-8b31-9330f0785514": "86b1b2af-8f1d-4155-9961-d06bbacbb7aa_0e121812-fcae-4d70-960f-ac440b3927e3",
  "xcp/sr/journal/clone/14131ee4-2956-47b7-8728-c9790764f71a": "dfb43813-91eb-46b8-9d56-22c8dbb485fc_917177d5-d03b-495c-b2db-fd62d3d25b86",
  "xcp/sr/journal/clone/45537c14-0125-4f6c-a1ad-476552888087": "36a23780-2025-4f3f-bade-03c410e63368_3e419764-9c8c-4539-9a42-be96f92e5c2a",
  "xcp/sr/journal/clone/54ec7009-2424-4299-a9ad-fb015600b88c": "af89f0fc-7d5a-4236-b249-8d9408f5fb6d_f32f2e8f-a43f-43f5-824b-f673a5cbd988",
  "xcp/sr/journal/clone/558220bc-a900-4408-a62e-a71a4bb4fd7b": "d9294359-c395-4bed-ac3a-bf4027c92bd9_0e18bf3d-78f0-4843-9e8f-ee11c6ebbf5a",
  "xcp/sr/journal/clone/c41e0d47-5c1a-45c3-9404-01f3b5735c0d": "e191eb57-2478-4e3b-be9d-e8eaba8f9efe_41eae673-a280-439b-a4c6-f3afe2390fde",
  "xcp/sr/journal/relink/50170fa2-2ca9-4218-8217-5c99ac31f10b": "1"
}

So I basically deleted all of the keys here, Maybe I should not have done that, but when I did, eva plugged in correctly to the SR and I was able to finally destroying the SR from XOA. So yeah happy ending. Will try the next beta version. Thank you @ronan-a for your work.

TheiLLeniumStudios

@ronan-a I tried following the guide that you posted to remove the linstor volumes manually but the resource-definition list command already showed a bunch of resources in a "DELETING" state.

[22:24 xcp-ng-node-1 ~]# linstor --controllers=192.168.10.211 resource-definition list
╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                                    ┊ Port ┊ ResourceGroup                    ┊ State    ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ xcp-persistent-database                         ┊ 7000 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-persistent-ha-statefile                     ┊ 7001 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-persistent-redo-log                         ┊ 7002 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-13a94a7a-d433-4426-8232-812e3c6dc52e ┊ 7004 ┊ xcp-sr-linstor_group_thin_device ┊ DELETING ┊
┊ xcp-volume-4b70d69b-9cca-4aa3-842f-09366ac76901 ┊ 7006 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-50aa2e9f-caf0-4b0d-82f3-35893987e53b ┊ 7010 ┊ xcp-sr-linstor_group_thin_device ┊ DELETING ┊
┊ xcp-volume-55c5c3fb-6782-46d6-8a81-f4a5f7cca691 ┊ 7012 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-5ebca692-6a61-47ec-8cac-e4fa0b6cc38a ┊ 7016 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-668bcb64-1150-43ac-baaa-db7b92331506 ┊ 7014 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-6f5235da-8f01-4057-a172-5e68bcb3f423 ┊ 7007 ┊ xcp-sr-linstor_group_thin_device ┊ DELETING ┊
┊ xcp-volume-70bf80a2-a008-469a-a7db-0ea92fcfc392 ┊ 7009 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-92d4d363-ef03-4d3c-9d47-bef5cb1ca181 ┊ 7015 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-9a413b51-2625-407a-b05c-62bff025b947 ┊ 7005 ┊ xcp-sr-linstor_group_thin_device ┊ ok       ┊
┊ xcp-volume-a02d160d-34fc-4fd6-957d-c7f3f9206ae2 ┊ 7008 ┊ xcp-sr-linstor_group_thin_device ┊ DELETING ┊
┊ xcp-volume-ed04ffda-b379-4be7-8935-4f534f969a3f ┊ 7003 ┊ xcp-sr-linstor_group_thin_device ┊ DELETING ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯

Executing resource-definition delete has no impact on them. I just get the following output:

[22:24 xcp-ng-node-1 ~]# linstor resource-definition delete xcp-volume-13a94a7a-d433-4426-8232-812e3c6dc52e
SUCCESS:
Description:
    Resource definition 'xcp-volume-13a94a7a-d433-4426-8232-812e3c6dc52e' marked for deletion.
Details:
    Resource definition 'xcp-volume-13a94a7a-d433-4426-8232-812e3c6dc52e' UUID is: 52aceda9-b19b-461a-a119-f62931ba1af9
WARNING:
Description:
    No active connection to satellite 'xcp-ng-node-3'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
SUCCESS:
    Resource 'xcp-volume-13a94a7a-d433-4426-8232-812e3c6dc52e' on 'xcp-ng-node-1' deleted
SUCCESS:
    Resource 'xcp-volume-13a94a7a-d433-4426-8232-812e3c6dc52e' on 'xcp-ng-node-2' deleted

I can confirm that node-1 can reach node-3 which it is complaining about for some reason. And I can see node-3 in XO as well and can run VMs on them.

ronan-a

@TheiLLeniumStudios In this case, if DRBD is completely stuck, you can reboot your hosts. There is probably a lock or processes that have a lock on them.

ronan-a

@AudleyElwine Thank you for your feedbacks, I will update the script to handle the journal cases.

TheiLLeniumStudios

@ronan-a 2 of the nodes broke after restarting. I just kept getting the blinking cursor at the top left of the screen for hours. I'm going to have to reprovision all the nodes again sadly

AudleyElwine

Hey @ronan-a ,

What should I do to lower the chance of something in the past installation of xostor to affect my new installation?
lsblk is still showing the linstor volumes, vgs is also showing linstor_group.
Will a wipefs -af be enough? Or is the "Destroying SR" button in XOA is enough?

ronan-a

@AudleyElwine The PVs/VGs are kept after a SR.destroy call but it's totally safe to reuse them for a new installation. The content of /var/lib/linstor is not removed after a destroy call, but it's normally not used because the linstor database is shared between hosts using a DRBD volume and mounted in this directory by the running controller. So you don't have manual steps to execute here.

Of course if you want to reuse your disks for another thing, wipefs is nice for that.