Crash when using IPsec VTI interfaces on 4.15 and 4.18.

Bug #1802480 reported by Vincent Bernat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Hey!

After upgrading a few VPN to 4.15.0-38.41 (either Xenial or Bionic), we get random crashes. This also happens with the 4.18 in bionic-proposed. These crashes didn't happen with 4.4 from Xenial. Here is a stack trace:

[ 31.154360] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[ 31.162233] PGD 0 P4D 0
[ 31.164786] Oops: 0000 [#1] SMP PTI
[ 31.168291] CPU: 5 PID: 42 Comm: ksoftirqd/5 Not tainted 4.18.0-11-generic #12~18.04.1-Ubuntu
[ 31.176854] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, BIOS 1.0b 11/21/2016
[ 31.184980] RIP: 0010:vti_rcv_cb+0xb9/0x1a0 [ip_vti]
[ 31.189962] Code: 8b 44 24 70 0f c8 89 87 b4 00 00 00 48 8b 86 20 05 00 00 8b 80 f8 14 00 00 85 c0 75 05 48 85 d2 74 0e 48 8b 43 58 48 83 e0 fe <f6> 40 38 04 74 7d 44 89 b3 b4 00 00 00 49 8b 44 24 20 48 39 86 20
[ 31.208916] RSP: 0018:ffffbc61832e3920 EFLAGS: 00010246
[ 31.214160] RAX: 0000000000000000 RBX: ffff9a3504964a00 RCX: 0000000000000002
[ 31.221328] RDX: ffff9a351add4080 RSI: ffff9a351aa08000 RDI: ffff9a3504964a00
[ 31.228485] RBP: ffffbc61832e3940 R08: 0000000000000004 R09: ffffffffc0aa612b
[ 31.235643] R10: 0008f09b99881884 R11: 1884bd4e2d6b1fac R12: ffff9a3507b31900
[ 31.242803] R13: ffff9a3507b31000 R14: 0000000000000000 R15: ffff9a3504964a00
[ 31.249964] FS: 0000000000000000(0000) GS:ffff9a35bfd40000(0000) knlGS:0000000000000000
[ 31.258077] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 31.263848] CR2: 0000000000000038 CR3: 000000041a40a003 CR4: 00000000003606e0
[ 31.271004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 31.278163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 31.285320] Call Trace:
[ 31.287789] xfrm4_rcv_cb+0x4a/0x70
[ 31.291297] xfrm_input+0x58f/0x8f0
[ 31.294807] vti_input+0xaa/0x110 [ip_vti]
[ 31.298926] vti_rcv+0x33/0x3c [ip_vti]
[ 31.302783] xfrm4_esp_rcv+0x39/0x50
[ 31.306375] ip_local_deliver_finish+0x62/0x200
[ 31.310923] ip_local_deliver+0xdf/0xf0
[ 31.314775] ? ip_rcv_finish+0x420/0x420
[ 31.318718] ip_rcv_finish+0x126/0x420
[ 31.322486] ip_rcv+0x28f/0x360
[ 31.325655] ? inet_del_offload+0x40/0x40
[ 31.329686] __netif_receive_skb_core+0x48c/0xb70
[ 31.334413] ? kmem_cache_alloc+0xb4/0x1d0
[ 31.338532] ? __build_skb+0x2b/0xf0
[ 31.342128] __netif_receive_skb+0x18/0x60
[ 31.346244] ? __netif_receive_skb+0x18/0x60
[ 31.350536] netif_receive_skb_internal+0x45/0xe0
[ 31.355263] napi_gro_receive+0xc5/0xf0
[ 31.359141] mlx5e_handle_rx_cqe+0x1b2/0x5d0 [mlx5_core]
[ 31.364476] ? skb_release_all+0x24/0x30
[ 31.368430] mlx5e_poll_rx_cq+0xd3/0x990 [mlx5_core]
[ 31.373432] mlx5e_napi_poll+0x9b/0xc60 [mlx5_core]
[ 31.378333] ? __switch_to_asm+0x34/0x70
[ 31.382270] ? __switch_to_asm+0x40/0x70
[ 31.386214] ? __switch_to_asm+0x34/0x70
[ 31.391056] ? __switch_to_asm+0x40/0x70
[ 31.395905] ? __switch_to_asm+0x34/0x70
[ 31.400743] net_rx_action+0x140/0x3a0
[ 31.405379] ? __switch_to+0xad/0x500
[ 31.409887] __do_softirq+0xe4/0x2bb
[ 31.414448] run_ksoftirqd+0x2b/0x40
[ 31.418862] smpboot_thread_fn+0xfc/0x170
[ 31.423700] kthread+0x121/0x140
[ 31.427701] ? sort_range+0x30/0x30
[ 31.432040] ? kthread_create_worker_on_cpu+0x70/0x70
[ 31.437816] ret_from_fork+0x35/0x40
[ 31.442219] Modules linked in: esp6 authenc echainiv xfrm6_mode_tunnel xfrm4_mode_tunnel xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo ip_vti ip_tunnel ip6_vti ip6_tunnel tunnel6 8021q garp mrp stp llc bonding ipt_REJECT nf_reject_ipv4 nfnetlink_log n
fnetlink xt_NFLOG xt_hl xt_limit xt_nat xt_TCPMSS xt_HL xt_comment xt_tcpudp xt_multiport xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_connmark xt_mark iptable_mangle xt_CT nf_conntrack xt_addrtype iptable_raw bpfilter ipmi_ssif gpio_
ich intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf input_leds joydev mei_me intel_pch_thermal ioatdma mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac_hid sch_fq_codel
[ 31.519488] ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib
_core raid1 hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ast pcbc ttm drm_kms_helper aesni_intel syscopyarea aes_x86_64 sysfillrect mxm_wmi crypto_simd sysimgblt cryptd glue_helper fb_sys_fops mlx5_core ixgbe igb mpt3sas drm ahci tls libahci i2c_algo_bit m
lxfw raid_class dca devlink mdio scsi_transport_sas wmi
[ 31.578877] CR2: 0000000000000038
[ 31.583249] ---[ end trace c4bada38847a0075 ]---

Upgrading to mainline 4.18.17 seems to solve the issue. It's difficult to bissect as it doesn't happen often. 4.18.17 contains c473a489d4098969ffafda913e1ad71da31b1104 (xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry) but it doesn't match the stacktrace (stacktrace is input, patch is output and forward). There is also fdb06c787b34fd397f28f515105627307d615025 (xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry) which is also in 4.17 and may better match the problem but I am unsure what it means to have several transformations (we use VTI interfaces, but other than that, we don't do anything fancy).

Hardware is Mellanox ConnectX-4 Lx (no ESP offload).

May I suggest upgrade 4.18 to 4.18.17 and to backport these two patches to Bionic 4.15?

Thanks.

Revision history for this message
Vincent Bernat (vbernat) wrote :

Commit fdb06c787b34fd397f28f515105627307d615025 title is "xfrm: reset transport header back to network header after all input transforms ahave been applied"

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1802480

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: cosmic
Revision history for this message
Vincent Bernat (vbernat) wrote :

Sorry, currently, all hosts are running a mainline kernel (4.18.17).

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Vincent Bernat (vbernat) wrote :

Nevermind, 4.18.17 is not enough to fix the crash. Currently testing with 4.19.1.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.20 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
tags: added: kernel-da-key
Revision history for this message
Vincent Bernat (vbernat) wrote :

This happens with 4.20-rc2. This is fixed by the patch referenced in this thread:

https://marc.info/?l=linux-netdev&m=154239557300724&w=2

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Revision history for this message
Jean-Philippe Menil (jpmenil) wrote :

The fix is available in commit:
commit 0152eee6fc3b84298bb6a79961961734e8afa5b8
Author: Steffen Klassert <email address hidden>
Date: Thu Nov 22 07:26:24 2018 +0100

    xfrm: Fix NULL pointer dereference in xfrm_input when skb_dst_force clea
rs the dst_entry.

Could be nice to backport this one.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The commit is already in upstream stable tree, so it will be included to future Ubuntu kernel release.

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.