XOSTOR 8.3 controller crash with guest OSes shutting down filesystem
-
Hello,
I am currently testing XOSTOR volume (xcp-ng 8.3 build 11 oct 2024, three hosts) and have experienced a two part problem:.
- linstor controller crashed, attaching /var/log/linstor-controller/ErrorReport, excerpt:
Error message: Failed to start transaction
Error message:
Error message: IO Exception: null [90028-197]
Error message: Reading from nio:/var/lib/linstor/linstordb.mv.db failed; file length 901120 read length 8192 at 0 [1.4.197/1]
Error message: Input/output error
as far as I can tell, controller was immediately started on one of remaining hosts, but
-
linux VMs (all 3 of them) lost access to disk ("Shutting down filesystem"), they're up2date centos 7, here's console screenshot:
-
After VM reboots, all went back to normal without any other action.
So it seems the biggest issue was the guest OSes giving up at the time of controller crash.
ErrorReport-679F8267-00000-000001.log.txt
Can we do something about it ?
- linstor controller crashed, attaching /var/log/linstor-controller/ErrorReport, excerpt:
-
Afterwards, I left two VMs using XOSTOR storage, each one on a different host, and "Shutting down fileststem" happened only on one of them, with the following report generated on the linstor controller:
ErrorReport-67B37339-00000-000000.log.txt
Kind regards,
-
Hi,
XOSTOR isn't yet supported officially on 8.3.
-
@olivierlambert
Hi,Yes, thank you, I am aware of that. I read all the docs/forums available, didn't find anything on the subject and just wanted to share the experience. Should I assume it's a known problem? - after all, that's what betas are for
Thanks,
-
-
@Dark199 In practice you should have more info via dmesg or kern.log. I have never seen this error until now, since it impacts VMs, I am afraid it is something quite serious. Are your disks ok? Do you have enough RAM on the Dom-0?
-
@ronan-a
Hello,I am uploading kern.log and drbd-kern.log for both events.
drbd-kern.Feb06.log.txt
kern.Feb06.log.txtdrbd-kern.Feb17.log.txt
kern.Feb17.log.txtDisks and RAM are 100% ok. But kernel logs make me wonder how XOSTOR should react for a short network outage ?
VMs did have local primary drbd resource (diskful volume, all the data they need was available on a local disk)# linstor resource list āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā® ā ResourceName ā Node ā Port ā Usage ā Conns ā State ā CreatedOn ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā” ā xcp-persistent-database ā xencc-hp03 ā 7000 ā Unused ā Ok ā UpToDate ā 2025-02-02 15:28:19 ā ā xcp-persistent-database ā xenrt-1 ā 7000 ā InUse ā Ok ā UpToDate ā 2025-02-02 15:28:18 ā ā xcp-persistent-database ā xenrt-2 ā 7000 ā Unused ā Ok ā Diskless ā 2025-02-02 15:28:17 ā ā xcp-volume-623a917e-614f-4176-8e58-505248ee9db4 ā xencc-hp03 ā 7004 ā InUse ā Ok ā UpToDate ā 2025-02-02 15:35:18 ā ā xcp-volume-623a917e-614f-4176-8e58-505248ee9db4 ā xenrt-1 ā 7004 ā Unused ā Ok ā UpToDate ā 2025-02-02 15:35:17 ā ā xcp-volume-623a917e-614f-4176-8e58-505248ee9db4 ā xenrt-2 ā 7004 ā Unused ā Ok ā TieBreaker ā 2025-02-02 15:35:17 ā ā xcp-volume-9dd3dc66-aa58-40f2-aa56-14b8846a4278 ā xencc-hp03 ā 7007 ā Unused ā Ok ā UpToDate ā 2025-02-04 16:18:46 ā ā xcp-volume-9dd3dc66-aa58-40f2-aa56-14b8846a4278 ā xenrt-1 ā 7007 ā Unused ā Ok ā UpToDate ā 2025-02-04 16:18:46 ā ā xcp-volume-9dd3dc66-aa58-40f2-aa56-14b8846a4278 ā xenrt-2 ā 7007 ā Unused ā Ok ā TieBreaker ā 2025-02-04 16:18:46 ā ā xcp-volume-e9428d9d-97a7-4a37-a2bb-630f8b5f3f0f ā xencc-hp03 ā 7005 ā Unused ā Ok ā UpToDate ā 2025-02-02 15:42:40 ā ā xcp-volume-e9428d9d-97a7-4a37-a2bb-630f8b5f3f0f ā xenrt-1 ā 7005 ā InUse ā Ok ā UpToDate ā 2025-02-02 15:42:40 ā ā xcp-volume-e9428d9d-97a7-4a37-a2bb-630f8b5f3f0f ā xenrt-2 ā 7005 ā Unused ā Ok ā TieBreaker ā 2025-02-02 15:42:39 ā ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
-
@ronan-a
[...]
64 bytes from 172.27.18.161: icmp_seq=21668 ttl=64 time=0.805 ms
64 bytes from 172.27.18.161: icmp_seq=21669 ttl=64 time=0.737 ms
64 bytes from 172.27.18.161: icmp_seq=21670 ttl=64 time=0.750 ms
64 bytes from 172.27.18.161: icmp_seq=21671 ttl=64 time=0.780 ms
64 bytes from 172.27.18.161: icmp_seq=21672 ttl=64 time=0.774 ms
64 bytes from 172.27.18.161: icmp_seq=21673 ttl=64 time=0.737 ms
64 bytes from 172.27.18.161: icmp_seq=21674 ttl=64 time=0.773 ms
64 bytes from 172.27.18.161: icmp_seq=21675 ttl=64 time=0.835 ms
64 bytes from 172.27.18.161: icmp_seq=21676 ttl=64 time=0.755 ms
1004711/1004716 packets, 0% loss, min/avg/ewma/max = 0.712/1.033/0.775/195.781 msI am attaching simple ping stats for last 11 days. I don't think we can blame the network