XOSTOR 8.3 controller crash with guest OSes shutting down filesystem

Dark199

Hello,

I am currently testing XOSTOR volume (xcp-ng 8.3 build 11 oct 2024, three hosts) and have experienced a two part problem:.

linstor controller crashed, attaching /var/log/linstor-controller/ErrorReport, excerpt:
Error message: Failed to start transaction
Error message:
Error message: IO Exception: null [90028-197]
Error message: Reading from nio:/var/lib/linstor/linstordb.mv.db failed; file length 901120 read length 8192 at 0 [1.4.197/1]
Error message: Input/output error

as far as I can tell, controller was immediately started on one of remaining hosts, but

linux VMs (all 3 of them) lost access to disk ("Shutting down filesystem"), they're up2date centos 7, here's console screenshot:
After VM reboots, all went back to normal without any other action.

So it seems the biggest issue was the guest OSes giving up at the time of controller crash.

ErrorReport-679F8267-00000-000001.log.txt

Can we do something about it ?

Dark199

Afterwards, I left two VMs using XOSTOR storage, each one on a different host, and "Shutting down fileststem" happened only on one of them, with the following report generated on the linstor controller:

ErrorReport-67B37339-00000-000000.log.txt

Kind regards,

olivierlambert

Hi,

XOSTOR isn't yet supported officially on 8.3.

Dark199

@olivierlambert
Hi,

Yes, thank you, I am aware of that. I read all the docs/forums available, didn't find anything on the subject and just wanted to share the experience. Should I assume it's a known problem? - after all, that's what betas are for

Thanks,

olivierlambert

Good question, maybe @ronan-a or @dthenot are aware.

ronan-a

@Dark199 In practice you should have more info via dmesg or kern.log. I have never seen this error until now, since it impacts VMs, I am afraid it is something quite serious. Are your disks ok? Do you have enough RAM on the Dom-0?

Dark199

@ronan-a
Hello,

I am uploading kern.log and drbd-kern.log for both events.

drbd-kern.Feb06.log.txt
kern.Feb06.log.txt

drbd-kern.Feb17.log.txt
kern.Feb17.log.txt

Disks and RAM are 100% ok. But kernel logs make me wonder how XOSTOR should react for a short network outage ?
VMs did have local primary drbd resource (diskful volume, all the data they need was available on a local disk)

# linstor resource list
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                                    ┊ Node       ┊ Port ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ xcp-persistent-database                         ┊ xencc-hp03 ┊ 7000 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-02-02 15:28:19 ┊
┊ xcp-persistent-database                         ┊ xenrt-1    ┊ 7000 ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2025-02-02 15:28:18 ┊
┊ xcp-persistent-database                         ┊ xenrt-2    ┊ 7000 ┊ Unused ┊ Ok    ┊   Diskless ┊ 2025-02-02 15:28:17 ┊
┊ xcp-volume-623a917e-614f-4176-8e58-505248ee9db4 ┊ xencc-hp03 ┊ 7004 ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2025-02-02 15:35:18 ┊
┊ xcp-volume-623a917e-614f-4176-8e58-505248ee9db4 ┊ xenrt-1    ┊ 7004 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-02-02 15:35:17 ┊
┊ xcp-volume-623a917e-614f-4176-8e58-505248ee9db4 ┊ xenrt-2    ┊ 7004 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-02-02 15:35:17 ┊
┊ xcp-volume-9dd3dc66-aa58-40f2-aa56-14b8846a4278 ┊ xencc-hp03 ┊ 7007 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-02-04 16:18:46 ┊
┊ xcp-volume-9dd3dc66-aa58-40f2-aa56-14b8846a4278 ┊ xenrt-1    ┊ 7007 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-02-04 16:18:46 ┊
┊ xcp-volume-9dd3dc66-aa58-40f2-aa56-14b8846a4278 ┊ xenrt-2    ┊ 7007 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-02-04 16:18:46 ┊
┊ xcp-volume-e9428d9d-97a7-4a37-a2bb-630f8b5f3f0f ┊ xencc-hp03 ┊ 7005 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-02-02 15:42:40 ┊
┊ xcp-volume-e9428d9d-97a7-4a37-a2bb-630f8b5f3f0f ┊ xenrt-1    ┊ 7005 ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2025-02-02 15:42:40 ┊
┊ xcp-volume-e9428d9d-97a7-4a37-a2bb-630f8b5f3f0f ┊ xenrt-2    ┊ 7005 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-02-02 15:42:39 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Dark199

@ronan-a
[...]
64 bytes from 172.27.18.161: icmp_seq=21668 ttl=64 time=0.805 ms
64 bytes from 172.27.18.161: icmp_seq=21669 ttl=64 time=0.737 ms
64 bytes from 172.27.18.161: icmp_seq=21670 ttl=64 time=0.750 ms
64 bytes from 172.27.18.161: icmp_seq=21671 ttl=64 time=0.780 ms
64 bytes from 172.27.18.161: icmp_seq=21672 ttl=64 time=0.774 ms
64 bytes from 172.27.18.161: icmp_seq=21673 ttl=64 time=0.737 ms
64 bytes from 172.27.18.161: icmp_seq=21674 ttl=64 time=0.773 ms
64 bytes from 172.27.18.161: icmp_seq=21675 ttl=64 time=0.835 ms
64 bytes from 172.27.18.161: icmp_seq=21676 ttl=64 time=0.755 ms
1004711/1004716 packets, 0% loss, min/avg/ewma/max = 0.712/1.033/0.775/195.781 ms

I am attaching simple ping stats for last 11 days. I don't think we can blame the network