what is kernel-alt?
-
Hello,
Because of a number of bugs and issues in current dom0 kernel, I am currently backporting ceph code from kernel 4.19.295 to https://github.com/xcp-ng-rpms/kernel - it works great, but while I was upgrading the kernel on my test cluster, I noticed there is a package kernel-alt which contains 4.19.265
What is that kernel? Is it stable? Does it contain all usual patches? Which repo does it come from?
-
Hi,
This is documented: https://docs.xcp-ng.org/installation/hardware/#alternate-kernel
-
great, anyway, here is the patch I made - https://github.com/xcp-ng-rpms/kernel/pull/9
I am still testing it though, but it compiles and works on XCP-ng lab I have just fine. It's going to take a while until I can confirm if it fixes the issues I was encountering though, they were rather rare.
-
Adding @stormi
I think it would be more reasonable to start patching on 8.3, because it seems to be a lot of modification for a LTS. Anyway, we'll take a look and come back to you, please do the same in terms of Ceph stability in your own tests! Thanks
-
Updating the driver on 8.3 would be an option indeed, if we can establish that patching it alone, without any other changes in the kernel, is enough and doesn't bring regressions over the current driver.
An other option would be packaging the newer driver as a separate
ceph-module-alt
RPM (orceph-modules-alt
if there are several kernel drivers). -
Hello, I can confirm that the patches I made contain significant stability improvements, I faced again kernel crashes related to CEPH but only on xcp-ng hosts that aren't patched, for example this is one of the bugs I am hitting on unpatched kernel:
[Fri Oct 13 11:10:32 2023] libceph: osd1 up [Fri Oct 13 11:10:34 2023] libceph: osd1 up [Fri Oct 13 11:10:39 2023] libceph: osd7 up [Fri Oct 13 11:10:40 2023] libceph: osd7 up [Fri Oct 13 11:10:41 2023] WARNING: CPU: 6 PID: 32615 at net/ceph/osd_client.c:554 request_reinit+0x128/0x150 [libceph] [Fri Oct 13 11:10:41 2023] Modules linked in: btrfs xor zstd_compress lzo_compress raid6_pq zstd_decompress xxhash rbd tun ebtable_filter ebtables ceph libceph rpcsec_gss_krb5 nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc dm_multipath ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg ipmi_si ipmi_devintf ipmi_msghandler video backlight acpi_power_meter nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables rndis_host cdc_ether usbnet mii hid_generic usbhid hid raid1 md_mod sd_mod ahci libahci xhci_pci igb(O) libata [Fri Oct 13 11:10:41 2023] ixgbe(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt [Fri Oct 13 11:10:41 2023] CPU: 6 PID: 32615 Comm: kworker/6:19 Tainted: G W O 4.19.0+1 #1 [Fri Oct 13 11:10:41 2023] Hardware name: Supermicro Super Server/X12STH-LN4F, BIOS 1.2 06/23/2022 [Fri Oct 13 11:10:41 2023] Workqueue: ceph-msgr ceph_con_workfn [libceph] [Fri Oct 13 11:10:41 2023] RIP: e030:request_reinit+0x128/0x150 [libceph] [Fri Oct 13 11:10:41 2023] Code: 5d 41 5e 41 5f c3 48 89 f9 48 c7 c2 b1 77 83 c0 48 c7 c6 96 ad 83 c0 48 c7 c7 98 5b 85 c0 31 c0 e8 ed a8 b9 c0 e9 37 ff ff ff <0f> 0b e9 41 ff ff ff 0f 0b e9 60 ff ff ff 0f 0b 0f 1f 84 00 00 00 [Fri Oct 13 11:10:41 2023] RSP: e02b:ffffc90045b67b88 EFLAGS: 00010202 [Fri Oct 13 11:10:41 2023] RAX: 0000000000000002 RBX: ffff8881c6704f00 RCX: ffff8881f27a10e0 [Fri Oct 13 11:10:41 2023] RDX: ffffffff00000002 RSI: ffff8881c7e97448 RDI: ffff8881c7d5b780 [Fri Oct 13 11:10:41 2023] RBP: ffff8881c6704700 R08: ffff8881c7e97450 R09: ffff8881c7e97450 [Fri Oct 13 11:10:41 2023] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881c7d5b780 [Fri Oct 13 11:10:41 2023] R13: fffffffffffffffe R14: 0000000000000000 R15: 0000000000000001 [Fri Oct 13 11:10:41 2023] FS: 0000000000000000(0000) GS:ffff8881f2780000(0000) knlGS:0000000000000000 [Fri Oct 13 11:10:41 2023] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Oct 13 11:10:41 2023] CR2: 00007f7ef7bf2000 CR3: 0000000136dbc000 CR4: 0000000000040660 [Fri Oct 13 11:10:41 2023] Call Trace: [Fri Oct 13 11:10:41 2023] send_linger+0x55/0x200 [libceph] [Fri Oct 13 11:10:41 2023] ceph_osdc_handle_map+0x4e7/0x6b0 [libceph] [Fri Oct 13 11:10:41 2023] dispatch+0x2ff/0xbc0 [libceph] [Fri Oct 13 11:10:41 2023] ? read_partial_message+0x265/0x810 [libceph] [Fri Oct 13 11:10:41 2023] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph] [Fri Oct 13 11:10:41 2023] ceph_con_workfn+0xa51/0x24f0 [libceph] [Fri Oct 13 11:10:41 2023] ? xen_hypercall_xen_version+0xa/0x20 [Fri Oct 13 11:10:41 2023] ? xen_hypercall_xen_version+0xa/0x20 [Fri Oct 13 11:10:41 2023] ? __switch_to_asm+0x34/0x70 [Fri Oct 13 11:10:41 2023] ? xen_force_evtchn_callback+0x9/0x10 [Fri Oct 13 11:10:41 2023] ? check_events+0x12/0x20 [Fri Oct 13 11:10:41 2023] process_one_work+0x165/0x370 [Fri Oct 13 11:10:41 2023] worker_thread+0x49/0x3e0 [Fri Oct 13 11:10:41 2023] kthread+0xf8/0x130 [Fri Oct 13 11:10:41 2023] ? rescuer_thread+0x310/0x310 [Fri Oct 13 11:10:41 2023] ? kthread_bind+0x10/0x10 [Fri Oct 13 11:10:41 2023] ret_from_fork+0x1f/0x40 [Fri Oct 13 11:10:41 2023] ---[ end trace 1ac50e4ca0f4e449 ]--- [Fri Oct 13 11:10:50 2023] libceph: osd4 up [Fri Oct 13 11:10:50 2023] libceph: osd4 up [Fri Oct 13 11:10:51 2023] WARNING: CPU: 11 PID: 3500 at net/ceph/osd_client.c:554 request_reinit+0x128/0x150 [libceph] [Fri Oct 13 11:10:51 2023] Modules linked in: btrfs xor zstd_compress lzo_compress raid6_pq zstd_decompress xxhash rbd tun ebtable_filter ebtables ceph libceph rpcsec_gss_krb5 nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc dm_multipath ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg ipmi_si ipmi_devintf ipmi_msghandler video backlight acpi_power_meter nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables rndis_host cdc_ether usbnet mii hid_generic usbhid hid raid1 md_mod sd_mod ahci libahci xhci_pci igb(O) libata [Fri Oct 13 11:10:51 2023] ixgbe(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt [Fri Oct 13 11:10:51 2023] CPU: 11 PID: 3500 Comm: kworker/11:16 Tainted: G W O 4.19.0+1 #1 [Fri Oct 13 11:10:51 2023] Hardware name: Supermicro Super Server/X12STH-LN4F, BIOS 1.2 06/23/2022 [Fri Oct 13 11:10:51 2023] Workqueue: ceph-msgr ceph_con_workfn [libceph] [Fri Oct 13 11:10:51 2023] RIP: e030:request_reinit+0x128/0x150 [libceph] [Fri Oct 13 11:10:51 2023] Code: 5d 41 5e 41 5f c3 48 89 f9 48 c7 c2 b1 77 83 c0 48 c7 c6 96 ad 83 c0 48 c7 c7 98 5b 85 c0 31 c0 e8 ed a8 b9 c0 e9 37 ff ff ff <0f> 0b e9 41 ff ff ff 0f 0b e9 60 ff ff ff 0f 0b 0f 1f 84 00 00 00 [Fri Oct 13 11:10:51 2023] RSP: e02b:ffffc900461d7b88 EFLAGS: 00010202 [Fri Oct 13 11:10:51 2023] RAX: 0000000000000002 RBX: ffff8881c62b6d00 RCX: 0000000000000000 [Fri Oct 13 11:10:51 2023] RDX: ffff8881c59c0740 RSI: ffff888137a50200 RDI: ffff8881c59c04a0 [Fri Oct 13 11:10:51 2023] RBP: ffff8881c62b6b00 R08: ffff8881f17c2e00 R09: ffff8881f162ba00 [Fri Oct 13 11:10:51 2023] R10: 0000000000000000 R11: 000000000000cb1b R12: ffff8881c59c04a0 [Fri Oct 13 11:10:51 2023] R13: fffffffffffffffe R14: 0000000000000000 R15: 0000000000000001 [Fri Oct 13 11:10:51 2023] FS: 0000000000000000(0000) GS:ffff8881f28c0000(0000) knlGS:0000000000000000 [Fri Oct 13 11:10:51 2023] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Oct 13 11:10:51 2023] CR2: 00007fa6188896c8 CR3: 00000001b7014000 CR4: 0000000000040660 [Fri Oct 13 11:10:51 2023] Call Trace: [Fri Oct 13 11:10:51 2023] send_linger+0x55/0x200 [libceph] [Fri Oct 13 11:10:51 2023] ceph_osdc_handle_map+0x4e7/0x6b0 [libceph] [Fri Oct 13 11:10:51 2023] dispatch+0x2ff/0xbc0 [libceph] [Fri Oct 13 11:10:51 2023] ? read_partial_message+0x265/0x810 [libceph] [Fri Oct 13 11:10:51 2023] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph] [Fri Oct 13 11:10:51 2023] ceph_con_workfn+0xa51/0x24f0 [libceph] [Fri Oct 13 11:10:51 2023] ? check_preempt_curr+0x84/0x90 [Fri Oct 13 11:10:51 2023] ? ttwu_do_wakeup+0x19/0x140 [Fri Oct 13 11:10:51 2023] process_one_work+0x165/0x370 [Fri Oct 13 11:10:51 2023] worker_thread+0x49/0x3e0 [Fri Oct 13 11:10:51 2023] kthread+0xf8/0x130 [Fri Oct 13 11:10:51 2023] ? rescuer_thread+0x310/0x310 [Fri Oct 13 11:10:51 2023] ? kthread_bind+0x10/0x10 [Fri Oct 13 11:10:51 2023] ret_from_fork+0x1f/0x40 [Fri Oct 13 11:10:51 2023] ---[ end trace 1ac50e4ca0f4e44a ]--- [Fri Oct 13 11:11:00 2023] rbd: rbd1: no lock owners detected [Fri Oct 13 11:11:07 2023] rbd: rbd1: no lock owners detected
patched hosts don't see any of them, logs just say:
[Fri Oct 13 11:11:47 2023] libceph: osd1 up [Fri Oct 13 11:11:54 2023] libceph: osd7 up [Fri Oct 13 11:12:05 2023] libceph: osd4 up
And this is only the less severe crash, I am sometimes facing crashes that make CEPH completely inaccessible on unpatched hypervisors, requiring host reboot.
I strongly recommend incorporating these patches to anyone who is using CEPH with xcp-ng
-
@stormi so it sounds reasonable to integrate it into 8.3 then?
-
@olivierlambert I think so. ceph.ko anyway isn't a core module for XCP-ng, so I don't think there's a high risk in patching it, especially with patches coming from upstream kernel.org.
Regarding the initial question, XCP-ng 8.3 will also have a newer kernel-alt. However, I don't recommend it for production, because it is a lot less tested in the context of XCP-ng.