@olivierlambert Just produced another reboot. I'm closing in on the way to replicate this issue.
Best posts made by darabontors
-
RE: Very scary host reboot issue
-
RE: Very scary host reboot issue
[ 334371.865769] ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 334371.865787] INFO: PGD 2250ed067 P4D 2250ed067 PUD 228c9f067 PMD 0 [ 334371.865803] WARN: Oops: 0000 [#1] SMP NOPTI [ 334371.865810] WARN: CPU: 9 PID: 57 Comm: ksoftirqd/9 Tainted: G O 4.19.0+1 #1 [ 334371.865818] WARN: Hardware name: Dell Inc. PowerEdge R720/0C4Y3R, BIOS 2.9.0 12/06/2019 [ 334371.865832] WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0 [ 334371.865839] WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff [ 334371.865858] WARN: RSP: e02b:ffffc9004026b6f8 EFLAGS: 00010282 [ 334371.865864] WARN: RAX: ffff888099621ae0 RBX: 0000000000000000 RCX: 00000000000000c0 [ 334371.865873] WARN: RDX: ffff888099621ac0 RSI: ffff888099621ac0 RDI: ffffea00031da880 [ 334371.865881] WARN: RBP: 0000000000000000 R08: ffff888099621a00 R09: ffff8881f0d43e98 [ 334371.865890] WARN: R10: ffffc9004026b8b0 R11: 0000000000000000 R12: ffff888096e61c00 [ 334371.865898] WARN: R13: 0000000000000000 R14: ffff88822b867a80 R15: 0000000000000000 [ 334371.865918] WARN: FS: 0000000000000000(0000) GS:ffff88822d440000(0000) knlGS:0000000000000000 [ 334371.865927] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 334371.865935] WARN: CR2: 0000000000000008 CR3: 00000002281aa000 CR4: 0000000000040660 [ 334371.865949] WARN: Call Trace: [ 334371.865958] WARN: skb_clone+0x71/0xa0 [ 334371.865968] WARN: do_execute_actions+0x4ec/0x1750 [openvswitch] [ 334371.865978] WARN: ? ovs_dp_process_packet+0x7d/0x110 [openvswitch] [ 334371.865988] WARN: ? ovs_vport_receive+0x6e/0xd0 [openvswitch] [ 334371.865997] WARN: ? arch_local_irq_restore+0x5/0x10 [ 334371.866005] WARN: ? get_page_from_freelist+0xa4f/0xf00 [ 334371.866012] WARN: ? arch_local_irq_restore+0x5/0x10 [ 334371.866020] WARN: ? get_page_from_freelist+0xa4f/0xf00 [ 334371.866031] WARN: ovs_execute_actions+0x47/0x120 [openvswitch] [ 334371.866040] WARN: ovs_dp_process_packet+0x7d/0x110 [openvswitch] [ 334371.866050] WARN: ? key_extract+0xa53/0xd60 [openvswitch] [ 334371.866058] WARN: ovs_vport_receive+0x6e/0xd0 [openvswitch] [ 334371.866066] WARN: ? __alloc_skb+0x4e/0x270 [ 334371.866075] WARN: ? notify_remote_via_irq+0x4a/0x70 [ 334371.866085] WARN: ? __raw_callee_save_xen_vcpu_stolen+0x11/0x20 [ 334371.866091] WARN: ? __alloc_skb+0x76/0x270 [ 334371.866100] WARN: ? arch_local_irq_restore+0x5/0x10 [ 334371.866108] WARN: ? __slab_alloc.constprop.81+0x42/0x4e [ 334371.866114] WARN: ? __alloc_skb+0x4e/0x270 [ 334371.866120] WARN: ? __kmalloc_track_caller+0x58/0x200 [ 334371.866127] WARN: ? __slab_alloc.constprop.81+0x42/0x4e [ 334371.866136] WARN: ? __kmalloc_reserve.isra.48+0x29/0x70 [ 334371.866146] WARN: netdev_frame_hook+0x105/0x180 [openvswitch] [ 334371.866154] WARN: __netif_receive_skb_core+0x211/0xb30 [ 334371.866163] WARN: __netif_receive_skb_one_core+0x36/0x70 [ 334371.866170] WARN: netif_receive_skb_internal+0x34/0xe0 [ 334371.866179] WARN: xenvif_tx_action+0x55c/0x990 [ 334371.866187] WARN: xenvif_poll+0x27/0x70 [ 334371.866193] WARN: net_rx_action+0x2a5/0x3e0 [ 334371.866200] WARN: __do_softirq+0xd1/0x28c [ 334371.866208] WARN: run_ksoftirqd+0x26/0x40 [ 334371.866215] WARN: smpboot_thread_fn+0x10e/0x160 [ 334371.866223] WARN: kthread+0xf8/0x130 [ 334371.866229] WARN: ? sort_range+0x20/0x20 [ 334371.866235] WARN: ? kthread_bind+0x10/0x10 [ 334371.866242] WARN: ret_from_fork+0x35/0x40 [ 334371.866250] WARN: Modules linked in: tun bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc dm_multipath ipt_REJECT nf_reject_ipv4 xt_tcpu$ [ 334371.866374] WARN: scsi_mod efivarfs ipv6 crc_ccitt [ 334371.866384] WARN: CR2: 0000000000000008 [ 334371.866396] WARN: ---[ end trace 8b74661a79be8268 ]--- [ 334371.868712] WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0 [ 334371.868721] WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff [ 334371.868740] WARN: RSP: e02b:ffffc9004026b6f8 EFLAGS: 00010282 [ 334371.868748] WARN: RAX: ffff888099621ae0 RBX: 0000000000000000 RCX: 00000000000000c0 [ 334371.868759] WARN: RDX: ffff888099621ac0 RSI: ffff888099621ac0 RDI: ffffea00031da880 [ 334371.868769] WARN: RBP: 0000000000000000 R08: ffff888099621a00 R09: ffff8881f0d43e98 [ 334371.868778] WARN: R10: ffffc9004026b8b0 R11: 0000000000000000 R12: ffff888096e61c00 [ 334371.868788] WARN: R13: 0000000000000000 R14: ffff88822b867a80 R15: 0000000000000000 [ 334371.868805] WARN: FS: 0000000000000000(0000) GS:ffff88822d440000(0000) knlGS:0000000000000000 [ 334371.868815] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 334371.868823] WARN: CR2: 0000000000000008 CR3: 00000002281aa000 CR4: 0000000000040660 [ 334371.868837] EMERG: Kernel panic - not syncing: Fatal exception in interrupt
-
RE: Xen Orchestra cannot connect to XCP-ng Host
I found the problem.
I am using OPNsense and forgot to disable TX checksum offloading. Very interesting that this checksum offloading caused catastrophic network disruptions on a Realtek nic, but no noticeable performance hit on Intel nics. This was an old host that featured a Realtek card. All my recent hosts that I use have only Intel nics. That is why I forgot about the whole offloading thing.Thanks for the tips.
Best wishes to the whole community!
Latest posts made by darabontors
-
RE: Xen Orchestra cannot connect to XCP-ng Host
I found the problem.
I am using OPNsense and forgot to disable TX checksum offloading. Very interesting that this checksum offloading caused catastrophic network disruptions on a Realtek nic, but no noticeable performance hit on Intel nics. This was an old host that featured a Realtek card. All my recent hosts that I use have only Intel nics. That is why I forgot about the whole offloading thing.Thanks for the tips.
Best wishes to the whole community!
-
RE: Xen Orchestra cannot connect to XCP-ng Host
@Danp
The is WireGuard site to site VPN set up. If ping works from inside the VM hosting Xen Orchestra how can Xen Orchestra have no access?I am almost sure it is a certificate issue of some kind. I would like to generate a new certificate or somehow make Xen Orchestra ignore the certificate. I think XCP-ng Center ignores it by default, that is why it works from XCP-ng Center.
What do you thing?
-
RE: Xen Orchestra cannot connect to XCP-ng Host
@Danp Thanks for responding.
I dont't use a HTTP proxy. I do have ping from Xen Orchestra to the host and from the host to the Xen Orchestra.I did get this error message in the logs:
server.enable
{
"id": "XXXXXXXXXXXXX"
}
{
"originalUrl": "https://X.X.X.X/jsonrpc",
"url": "https://X.X.X.X/jsonrpc",
"call": {
"method": "session.login_with_password",
"params": "* obfuscated *"
},
"message": "408 Request Timeout",
"name": "Error",
"stack": "Error: 408 Request Timeout
at Object.assertSuccess (/opt/xen-orchestra/node_modules/http-request-plus/index.js:162:19)
at httpRequestPlus (/opt/xen-orchestra/node_modules/http-request-plus/index.js:217:22)
at file:///opt/xen-orchestra/packages/xen-api/transports/json-rpc.mjs:13:17"
}I can connect via XCP-ng Center to the host, no problem. It's just Xen Orchestra that can't connect.
-
Xen Orchestra cannot connect to XCP-ng Host
Dear community,
I have a strange connection problem. I have the following situation:
I need to install XCP-ng with DHCP assigned IP address so that I can connect it to my Xen Orchestra. I can connect to the host with this DHCP IP address. After I finish setting up my XCP-ng from Xen Orchestra, I need to give the host a new IP for management. A static IP, on a VLAN network.After the IP change, I could connect to the host with this new IP. After moving the host to a different location, suddenly there is an unspecified connection error while connecting to the host. This problem is only between Xen Orchestra and the host. I can connect with XCP-ng Center to the host, no problem. All networking works as should.
I mention that when I changed the IP of the host, I also changed the root password.
I suspect it is a certificate issue. It is the self signed certificate that XCP-ng generated during installation.
The host is not exposed to the public internet. I use a VPN to connect it to Xen Orchestra.
I'm using Xen Orchestra from the sources.
Please help me fix this issue. This is a remote host and I already reinstalled XCP-ng, but the issue came back.
-
RE: Very scary host reboot issue
- Absolutely no idea how to do this in Windows. I looked for any MTU setting but couldn't find any.
- This is not a viable workaround for me, maybe it would be useful to pin the issue to the xen PV driver, maybe I'll do some more testing on spare hardware.
- I read this, but I don't know how to test it. I didn't have any manual MTUs set so I don't know what values were before the update.
What most definitely fixed the issue for me was using PCIe passthrough for the WAN interface. I used a 10 GbE NIC. It uses the ix driver (ix0) so IDK if this is related. Somehow PPPoE + WG + Windows Client on the virtual interface (Xen PV driver) in OPNsense produces this issue.
At the moment I am happy with this mitigation.I'm a little spread thin with free time at the moment. Anyone care to test this further?
-
RE: Very scary host reboot issue
@Andrew That makes sense. I think I'll do just this. In the meantime I'll try to replicate the phenomenon on test hardware. I really need a permanent fix for this..
-
RE: Very scary host reboot issue
@olivierlambert I'm thinking of a quick workaround. What if I use pci pass-through for the LAN and WAN interfaces and I physically connect the LAN port to another non PCIe pass-through port of the server and I use that port toninterface with my other VMs via OVS? Does it make any sense? Does it seem viable to mitigate this issue?
-
RE: Very scary host reboot issue
@olivierlambert said in Very scary host reboot issue:
FreeBSD PV driver inside OPNsense or Pfsense.
Who is maintaining the FreeBSD PV drivers?
-
RE: Very scary host reboot issue
I found the MTU parameter. This time it was 1420 on both OPNsense WG interface and in Windows (client side). I was happy for about 5 minutes as I wasn't able to reproduce the crash, but then it happened again. My "favorite" way to trigger it is by pausing the file transfer, waiting for a couple of minutes and then resuming it. The transfer's MB/s jumps up like crazy in Windows, then freezes until it gets in sync with the real progress of the transfer. After two tries of pausing and resuming, the crash happened.
@olivierlambert I use this setup on my infrastructure and my clients since at least 4 years. I never experienced this issue until as recent as September this year. You guys saw this issue ~6 months ago. Isn't there a way to backtrack any recent updates to Openswitch? I know it might be some updates on the FreeBSD side that made this openswitch bug surface just in recent times... I know there was little to no development on the WireGuard side of things this year.