@olivierlambert
Hello, I can confirm that the patches I made contain significant stability improvements, I faced again kernel crashes related to CEPH but only on xcp-ng hosts that aren't patched, for example this is one of the bugs I am hitting on unpatched kernel:
[Fri Oct 13 11:10:32 2023] libceph: osd1 up
[Fri Oct 13 11:10:34 2023] libceph: osd1 up
[Fri Oct 13 11:10:39 2023] libceph: osd7 up
[Fri Oct 13 11:10:40 2023] libceph: osd7 up
[Fri Oct 13 11:10:41 2023] WARNING: CPU: 6 PID: 32615 at net/ceph/osd_client.c:554 request_reinit+0x128/0x150 [libceph]
[Fri Oct 13 11:10:41 2023] Modules linked in: btrfs xor zstd_compress lzo_compress raid6_pq zstd_decompress xxhash rbd tun ebtable_filter ebtables ceph libceph rpcsec_gss_krb5 nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc dm_multipath ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg ipmi_si ipmi_devintf ipmi_msghandler video backlight acpi_power_meter nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables rndis_host cdc_ether usbnet mii hid_generic usbhid hid raid1 md_mod sd_mod ahci libahci xhci_pci igb(O) libata
[Fri Oct 13 11:10:41 2023] ixgbe(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
[Fri Oct 13 11:10:41 2023] CPU: 6 PID: 32615 Comm: kworker/6:19 Tainted: G W O 4.19.0+1 #1
[Fri Oct 13 11:10:41 2023] Hardware name: Supermicro Super Server/X12STH-LN4F, BIOS 1.2 06/23/2022
[Fri Oct 13 11:10:41 2023] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[Fri Oct 13 11:10:41 2023] RIP: e030:request_reinit+0x128/0x150 [libceph]
[Fri Oct 13 11:10:41 2023] Code: 5d 41 5e 41 5f c3 48 89 f9 48 c7 c2 b1 77 83 c0 48 c7 c6 96 ad 83 c0 48 c7 c7 98 5b 85 c0 31 c0 e8 ed a8 b9 c0 e9 37 ff ff ff <0f> 0b e9 41 ff ff ff 0f 0b e9 60 ff ff ff 0f 0b 0f 1f 84 00 00 00
[Fri Oct 13 11:10:41 2023] RSP: e02b:ffffc90045b67b88 EFLAGS: 00010202
[Fri Oct 13 11:10:41 2023] RAX: 0000000000000002 RBX: ffff8881c6704f00 RCX: ffff8881f27a10e0
[Fri Oct 13 11:10:41 2023] RDX: ffffffff00000002 RSI: ffff8881c7e97448 RDI: ffff8881c7d5b780
[Fri Oct 13 11:10:41 2023] RBP: ffff8881c6704700 R08: ffff8881c7e97450 R09: ffff8881c7e97450
[Fri Oct 13 11:10:41 2023] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881c7d5b780
[Fri Oct 13 11:10:41 2023] R13: fffffffffffffffe R14: 0000000000000000 R15: 0000000000000001
[Fri Oct 13 11:10:41 2023] FS: 0000000000000000(0000) GS:ffff8881f2780000(0000) knlGS:0000000000000000
[Fri Oct 13 11:10:41 2023] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Oct 13 11:10:41 2023] CR2: 00007f7ef7bf2000 CR3: 0000000136dbc000 CR4: 0000000000040660
[Fri Oct 13 11:10:41 2023] Call Trace:
[Fri Oct 13 11:10:41 2023] send_linger+0x55/0x200 [libceph]
[Fri Oct 13 11:10:41 2023] ceph_osdc_handle_map+0x4e7/0x6b0 [libceph]
[Fri Oct 13 11:10:41 2023] dispatch+0x2ff/0xbc0 [libceph]
[Fri Oct 13 11:10:41 2023] ? read_partial_message+0x265/0x810 [libceph]
[Fri Oct 13 11:10:41 2023] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
[Fri Oct 13 11:10:41 2023] ceph_con_workfn+0xa51/0x24f0 [libceph]
[Fri Oct 13 11:10:41 2023] ? xen_hypercall_xen_version+0xa/0x20
[Fri Oct 13 11:10:41 2023] ? xen_hypercall_xen_version+0xa/0x20
[Fri Oct 13 11:10:41 2023] ? __switch_to_asm+0x34/0x70
[Fri Oct 13 11:10:41 2023] ? xen_force_evtchn_callback+0x9/0x10
[Fri Oct 13 11:10:41 2023] ? check_events+0x12/0x20
[Fri Oct 13 11:10:41 2023] process_one_work+0x165/0x370
[Fri Oct 13 11:10:41 2023] worker_thread+0x49/0x3e0
[Fri Oct 13 11:10:41 2023] kthread+0xf8/0x130
[Fri Oct 13 11:10:41 2023] ? rescuer_thread+0x310/0x310
[Fri Oct 13 11:10:41 2023] ? kthread_bind+0x10/0x10
[Fri Oct 13 11:10:41 2023] ret_from_fork+0x1f/0x40
[Fri Oct 13 11:10:41 2023] ---[ end trace 1ac50e4ca0f4e449 ]---
[Fri Oct 13 11:10:50 2023] libceph: osd4 up
[Fri Oct 13 11:10:50 2023] libceph: osd4 up
[Fri Oct 13 11:10:51 2023] WARNING: CPU: 11 PID: 3500 at net/ceph/osd_client.c:554 request_reinit+0x128/0x150 [libceph]
[Fri Oct 13 11:10:51 2023] Modules linked in: btrfs xor zstd_compress lzo_compress raid6_pq zstd_decompress xxhash rbd tun ebtable_filter ebtables ceph libceph rpcsec_gss_krb5 nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc dm_multipath ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg ipmi_si ipmi_devintf ipmi_msghandler video backlight acpi_power_meter nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables rndis_host cdc_ether usbnet mii hid_generic usbhid hid raid1 md_mod sd_mod ahci libahci xhci_pci igb(O) libata
[Fri Oct 13 11:10:51 2023] ixgbe(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
[Fri Oct 13 11:10:51 2023] CPU: 11 PID: 3500 Comm: kworker/11:16 Tainted: G W O 4.19.0+1 #1
[Fri Oct 13 11:10:51 2023] Hardware name: Supermicro Super Server/X12STH-LN4F, BIOS 1.2 06/23/2022
[Fri Oct 13 11:10:51 2023] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[Fri Oct 13 11:10:51 2023] RIP: e030:request_reinit+0x128/0x150 [libceph]
[Fri Oct 13 11:10:51 2023] Code: 5d 41 5e 41 5f c3 48 89 f9 48 c7 c2 b1 77 83 c0 48 c7 c6 96 ad 83 c0 48 c7 c7 98 5b 85 c0 31 c0 e8 ed a8 b9 c0 e9 37 ff ff ff <0f> 0b e9 41 ff ff ff 0f 0b e9 60 ff ff ff 0f 0b 0f 1f 84 00 00 00
[Fri Oct 13 11:10:51 2023] RSP: e02b:ffffc900461d7b88 EFLAGS: 00010202
[Fri Oct 13 11:10:51 2023] RAX: 0000000000000002 RBX: ffff8881c62b6d00 RCX: 0000000000000000
[Fri Oct 13 11:10:51 2023] RDX: ffff8881c59c0740 RSI: ffff888137a50200 RDI: ffff8881c59c04a0
[Fri Oct 13 11:10:51 2023] RBP: ffff8881c62b6b00 R08: ffff8881f17c2e00 R09: ffff8881f162ba00
[Fri Oct 13 11:10:51 2023] R10: 0000000000000000 R11: 000000000000cb1b R12: ffff8881c59c04a0
[Fri Oct 13 11:10:51 2023] R13: fffffffffffffffe R14: 0000000000000000 R15: 0000000000000001
[Fri Oct 13 11:10:51 2023] FS: 0000000000000000(0000) GS:ffff8881f28c0000(0000) knlGS:0000000000000000
[Fri Oct 13 11:10:51 2023] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Oct 13 11:10:51 2023] CR2: 00007fa6188896c8 CR3: 00000001b7014000 CR4: 0000000000040660
[Fri Oct 13 11:10:51 2023] Call Trace:
[Fri Oct 13 11:10:51 2023] send_linger+0x55/0x200 [libceph]
[Fri Oct 13 11:10:51 2023] ceph_osdc_handle_map+0x4e7/0x6b0 [libceph]
[Fri Oct 13 11:10:51 2023] dispatch+0x2ff/0xbc0 [libceph]
[Fri Oct 13 11:10:51 2023] ? read_partial_message+0x265/0x810 [libceph]
[Fri Oct 13 11:10:51 2023] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
[Fri Oct 13 11:10:51 2023] ceph_con_workfn+0xa51/0x24f0 [libceph]
[Fri Oct 13 11:10:51 2023] ? check_preempt_curr+0x84/0x90
[Fri Oct 13 11:10:51 2023] ? ttwu_do_wakeup+0x19/0x140
[Fri Oct 13 11:10:51 2023] process_one_work+0x165/0x370
[Fri Oct 13 11:10:51 2023] worker_thread+0x49/0x3e0
[Fri Oct 13 11:10:51 2023] kthread+0xf8/0x130
[Fri Oct 13 11:10:51 2023] ? rescuer_thread+0x310/0x310
[Fri Oct 13 11:10:51 2023] ? kthread_bind+0x10/0x10
[Fri Oct 13 11:10:51 2023] ret_from_fork+0x1f/0x40
[Fri Oct 13 11:10:51 2023] ---[ end trace 1ac50e4ca0f4e44a ]---
[Fri Oct 13 11:11:00 2023] rbd: rbd1: no lock owners detected
[Fri Oct 13 11:11:07 2023] rbd: rbd1: no lock owners detected
patched hosts don't see any of them, logs just say:
[Fri Oct 13 11:11:47 2023] libceph: osd1 up
[Fri Oct 13 11:11:54 2023] libceph: osd7 up
[Fri Oct 13 11:12:05 2023] libceph: osd4 up
And this is only the less severe crash, I am sometimes facing crashes that make CEPH completely inaccessible on unpatched hypervisors, requiring host reboot.
I strongly recommend incorporating these patches to anyone who is using CEPH with xcp-ng