Frequent Panic in ip6_expire_frag_queue->icmpv6_send on 4.4.0-184-generic

Bug #1883498 reported by Heikki Hannikainen
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Xenial
Fix Released
High
Kleber Sacilotto de Souza

Bug Description

I happened to do an upgrade on a number of servers last week. Some of them got 4.4.0-179-generic and the ones upgraded a bit later during the week got 4.4.0-184-generic as it was just released. The ones with 4.4.0-184-generic started getting stuck. With linux-crashdump installed I obtained the dmesgs and crash dumps. The backtrace appears somewhat similar to #202669 but that one only happened on bare hardware for us - this one is on KVM virtual instances. #202669 paniced in icmpv6_route_lookup and this one dies already in icmpv6_send.

Between 2020-06-11 and 2020-06-15, on a set of 12 VMs running 4.4.0-184-generic, there were 85 crashes like this, on servers with noticeable IPv6 traffic. All of the 12 VMs with 4.4.0-184-generic crashed at least once. (There are more than 12 VMs experiencing this, this is just the set I had linux-crashdump on.)

[57063.487084] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[57063.487184] IP: [<ffffffff818288ab>] icmp6_send+0x1fb/0x970
[57063.487218] PGD 0
[57063.487231] Oops: 0000 [#1] SMP
[57063.488665] Call Trace:
[57063.488679] <IRQ>
[57063.488705] [<ffffffff81756ee8>] ? __netif_receive_skb+0x18/0x60
[57063.488739] [<ffffffff810c3758>] ? task_tick_fair+0x4c8/0x8e0
[57063.488771] [<ffffffff81868280>] ? _raw_spin_unlock_bh+0x20/0x50
[57063.488802] [<ffffffff81841ed1>] icmpv6_send+0x21/0x30
[57063.488829] [<ffffffff8182fe95>] ip6_expire_frag_queue+0x115/0x1b0
[57063.488862] [<ffffffffc0366260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[57063.488897] [<ffffffffc036627f>] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6]
[57063.488937] [<ffffffff810f57c7>] call_timer_fn+0x37/0x140
[57063.488965] [<ffffffffc0366260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[57063.489002] [<ffffffff810f70d4>] run_timer_softirq+0x234/0x330
...

Revision history for this message
Balint Harmath (bharmath) wrote :

Happened more than once on separate VMs.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Heikki Hannikainen (hessu) wrote :
Revision history for this message
Heikki Hannikainen (hessu) wrote :

Paste error in original report; the related-but-not-quite-the-same bug was here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1824687

Revision history for this message
Dennis (dvaerum) wrote :
Download full text (4.8 KiB)

We are having the same bug (I believe) after upgrading from kernel linux-image-4.4.0-178-generic to linux-image-4.4.0-184-generic.

We have around 100 VMs there are affected. For now, we have rolled back to the previous kernel. I am not sure why but not all VMs are affected, from what I have found, it looks like unbound (DNS server) is triggering the kernel oops our clients environment.

I can help test a new kernel if that could help/be useful. I also have a kernel dump from linux-crashdump, but I am not currently sure if I am allow to share it, but I will try to figure it out if needed.

### Our kernel crash
[ 128.503474] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 128.503608] IP: [<ffffffff818288ab>] icmp6_send+0x1fb/0x970
[ 128.503673] PGD 80000004275f2067 PUD 427495067 PMD 0
[ 128.503736] Oops: 0000 [#1] SMP
[ 128.503800] Modules linked in: vmw_vsock_vmci_transport vsock zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) vmw_balloon input_leds joydev serio_raw shpchp vmw_vmci i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd vmwgfx ttm drm_kms_helper psmouse syscopyarea sysfillrect vmxnet3 sysimgblt vmw_pvscsi fb_sys_fops pata_acpi drm ahci libahci fjes
[ 128.504798] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 4.4.0-184-generic #214-Ubuntu
[ 128.504990] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
[ 128.505401] task: ffffffff81e13500 ti: ffffffff81e00000 task.ti: ffffffff81e00000
[ 128.505637] RIP: 0010:[<ffffffff818288ab>] [<ffffffff818288ab>] icmp6_send+0x1fb/0x970
[ 128.505892] RSP: 0018:ffff88042d603d00 EFLAGS: 00010246
[ 128.506143] RAX: 0000000000000000 RBX: ffff880423804a00 RCX: 0000000000000020
[ 128.506409] RDX: 0000000000000001 RSI: 0000000000000200 RDI: ffff880427ce1856
[ 128.506686] RBP: ffff88042d603e20 R08: 0000000000000000 R09: ffff880427ce1866
[ 128.506962] R10: 0000000000000080 R11: 0000000000000000 R12: ffff880427ce184e
[ 128.507246] R13: ffffffff81efb6c0 R14: 0000000000000001 R15: 0000000000000003
[ 128.507539] FS: 0000000000000000(0000) GS:ffff88042d600000(0000) knlGS:0000000000000000
[ 128.507842] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 128.508176] CR2: 0000000000000018 CR3: 0000000427782000 CR4: 0000000000360670
[ 128.508530] Stack:
[ 128.508859] 0000000000000001 0000000000000000 0000000000000000 4a7e338b0c959fd7
[ 128.509212] ffff88042b139a38 ffff88042b139a80 000000002b139a20 ffff880427ce1856
[ 128.509577] ffff880400000001 ffffffff00000000 ffff880427ce1866 0000000000000000
[ 128.509945] Call Trace:
[ 128.510314] <IRQ>
[ 128.510324] [<ffffffff81868280>] ? _raw_spin_unlock_bh+0x20/0x50
[ 128.511089] [<ffffffff81841ed1>] icmpv6_send+0x21/0x30
[ 128.511483] [<ffffffff8182fe95>] ip6_expire_frag_queue+0x115/0x1b0
[ 128.511892] [...

Read more...

Revision history for this message
NTS Workspace (nts-workspace) wrote :

We had the same behaviour like Dennis!
Kernel Panic after aprox. 4 our on kernel 4.4.0-184
Unbound installed as well with version 1.5.8
Unfortunately i have no crashdump by hand, but our panic was "same" as Dennis with ipv6 messages.

Revision history for this message
DivaD (d2u) wrote :
Download full text (4.6 KiB)

I can confirm that we experienced the same problem on one VM after upgrade from 4.4.0-179-generic to 4.4.0-184-generic last weekend. Since the rollback to the last working kernel this VM is running stable for over 25h now.
Ubound Version 1.5.8 is also installed and running on this VM
Don't have any crashdump, but the traceback looks the same:

[ 1963.770497] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 1963.781264] IP: [<ffffffff818288ab>] icmp6_send+0x1fb/0x970
[ 1963.782881] PGD 0
[ 1963.783503] Oops: 0000 [#1] SMP
[ 1963.784479] Modules linked in: binfmt_misc ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables kvm_intel kvm irqbypass input_leds joydev serio_raw i2c_piix4 mac_hid 8250_fintek autofs4 qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse drm pata_acpi floppy
[ 1963.794748] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.0-184-generic #214-Ubuntu
[ 1963.796182] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 1963.797696] task: ffff880236330d00 ti: ffff88023633c000 task.ti: ffff88023633c000
[ 1963.799081] RIP: 0010:[<ffffffff818288ab>] [<ffffffff818288ab>] icmp6_send+0x1fb/0x970
[ 1963.800628] RSP: 0018:ffff88023fd03d00 EFLAGS: 00010246
[ 1963.801630] RAX: 0000000000000000 RBX: ffff8800bbad6700 RCX: 0000000000000020
[ 1963.802948] RDX: 0000000000000001 RSI: 0000000000000200 RDI: ffff880232c86a56
[ 1963.804281] RBP: ffff88023fd03e20 R08: 0000000000000000 R09: ffff880232c86a66
[ 1963.805625] R10: 0000000000000080 R11: 0000000000000000 R12: ffff880232c86a4e
[ 1963.806951] R13: ffffffff81efb6c0 R14: 0000000000000001 R15: 0000000000000003
[ 1963.808399] FS: 0000000000000000(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[ 1963.809910] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1963.810987] CR2: 0000000000000018 CR3: 0000000234d7a000 CR4: 0000000000000670
[ 1963.812324] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1963.813662] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1963.815001] Stack:
[ 1963.815395] 0000000000000001 0000000000000000 0000000000000000 23dc61b86a54e883
[ 1963.816889] ffff8802371b3a98 ffff8802371b3ae0 00000000371b3a80 ffff880232c86a56
[ 1963.818372] ffff880200000001 ffffffff00000000 ffff880232c86a66 0000000000000000
[ 1963.819847] Call Trace:
[ 1963.820317] <IRQ>
[ 1963.820778] [<ffffffffc0158e40>] ? emulator_pio_in_emulated+0x1a0/0x1a0 [kvm]
[ 1963.822192] [<ffffffff810a87bc>] ? notifier_call_chain+0x4c/0x70
[ 1963.823330] [<ffffffff81868280>] ? _raw_spin_unlock_bh+0x20/0x50
[ 1963.824475] [<ffffffff81841ed1>] icmpv6_send+0x21/0x30
[ 1963.825452] [<ffffffff8182fe95>] ip6_expire_frag_queue+0x115/0x1b0
[ 1963.826622] [<ffffffffc024b260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[ 1963.827951] [<ffffffffc024b27f>] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6]
[ 1963.829365] [<ffffffff810f57c7>] call_timer_fn+0x37/0x140
[ 1963.830428] [<ff...

Read more...

Stefan Bader (smb)
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Changed in linux (Ubuntu Xenial):
assignee: nobody → Kleber Sacilotto de Souza (kleber-souza)
Revision history for this message
Sultan Alsawaf (kerneltoast) wrote :

It looks like this is fixed upstream with this change: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit?id=178af2f97dcaea27611f0420ec7b61c1a27d6776

Which is contained in the Ubuntu-4.4.0-185 kernel already. So Ubuntu-4.4.0-185 should be fixed.

Revision history for this message
Joacim Sørheim (blueh) wrote :

I have a server running with this (4.4.0-185) kernel right now, with Unbound installed, and it haven't panicked yet. This server hung seconds after boot with 4.4.0-184, so I believe it's a good candidate.

Revision history for this message
Dennis (dvaerum) wrote :

@blueh how did you install the 4.4.0-185 kernel? I don't seen to have it as an option

Revision history for this message
Joacim Sørheim (blueh) wrote :

@dvaerum I installed it from the xenial-proposed archive, for testing purpose only of course.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

I can deploy 4.4.0-185 from xenial-proposed for testing on Monday. The fix looks good, I was already staring at https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit?id=5d41ce29e3b91ef305f88d23f72b3359de329cec which came in 4.4.0-184.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

I have deployed 4.4.0-185 from xenial-proposed to 6 VMs on Monday. No crashes yet, so it seems to me that 4.4.0-185 is good and fixes this issue. On 4.4.0-184 these ones crashed very frequently.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

4.4.0-185 from xenial-proposed seems stable still. Any chance of releasing it to mainline xenial soon, we have some other needs to run upgrades and deploying proposed packages is a bit of a hassle for a fleet of hundreds?

tags: added: verification-done-xenial
Revision history for this message
NTS Workspace (nts-workspace) wrote :

i've made a upgrade to the latest kernel now on a server with unbound installed.
$ uname -a
Linux nsr4-cbn 4.4.0-185-generic #215-Ubuntu SMP Mon Jun 8 21:53:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/os-release
VERSION="16.04.6 LTS (Xenial Xerus)"

it seems 4.4.0-185 is out?!

As far, no Kernel Panic

Revision history for this message
fortin (fortin81) wrote :

I can confirm that since booting on 4.4.0-185 (~24h ago), we have not experienced any panics on our systems; while we had 4 panics in less than 24h on 4.4.0-184.

Revision history for this message
NTS Workspace (nts-workspace) wrote :

Is there any news on this Bug Report?

Revision history for this message
Heikki Hannikainen (hessu) wrote :

4.4.0-186 is already released. Both 4.4.0-185 and 4.4.0-186 contain the fix for this issue and work fine for me, no crashes observed.

Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.