Ubuntu
linux package

Bug #1886668
Activity log

Activity log for bug #1886668

Date	Who	What changed	Old value	New value	Message
2020-07-07 14:16:31	Steve Beattie	bug			added bug
2020-07-07 14:16:58	Steve Beattie	summary	placeholder	linux 4.15.0-109-generic network DoS regression vs -108
2020-07-07 14:20:35	Steve Beattie	description	placeholder	Reported from a user: Several of our infrastructure VMs recently started crashing (oops attached), after they upgraded to -109. -108 appears to be stable. Analysing the crash, it appears to be a wild pointer access in a BPF filter, which makes this (probably) a network-traffic triggered crash. [ 696.396831] general protection fault: 0000 [#1] SMP PTI [ 696.396843] Modules linked in: iscsi_target_mod target_core_mod ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge nfsv3 cmac arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs aufs ccm fscache binfmt_misc overlay xfs libcrc32c intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev intel_rapl_perf serio_raw parport_pc parport mac_hid sch_fq_codel nfsd 8021q auth_rpcgss garp nfs_acl mrp lockd stp llc grace xenfs sunrpc xen_privcmd ip_tables x_tables autofs4 hid_generic usbhid hid psmouse i2c_piix4 pata_acpi floppy [ 696.396966] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-109-generic #110-Ubuntu [ 696.396979] Hardware name: Xen HVM domU, BIOS 4.7.6-1.26 12/03/2018 [ 696.396993] RIP: 0010:__cgroup_bpf_run_filter_skb+0xbb/0x1e0 [ 696.397005] RSP: 0018:ffff893fdcb83a70 EFLAGS: 00010292 [ 696.397015] RAX: 6d69546e6f697469 RBX: 0000000000000000 RCX: 0000000000000014 [ 696.397028] RDX: 0000000000000000 RSI: ffff893fd0360000 RDI: ffff893fb5154800 [ 696.397041] RBP: ffff893fdcb83ad0 R08: 0000000000000001 R09: 0000000000000000 [ 696.397058] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000014 [ 696.397075] R13: ffff893fb5154800 R14: 0000000000000020 R15: ffff893fc6ba4d00 [ 696.397091] FS: 0000000000000000(0000) GS:ffff893fdcb80000(0000) knlGS:0000000000000000 [ 696.397107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 696.397119] CR2: 000000c0001b4000 CR3: 00000006dce0a004 CR4: 00000000003606e0 [ 696.397135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 696.397152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 696.397169] Call Trace: [ 696.397175] <IRQ> [ 696.397183] sk_filter_trim_cap+0xd0/0x1b0 [ 696.397191] tcp_v4_rcv+0x8b7/0xa80 [ 696.397199] ip_local_deliver_finish+0x66/0x210 [ 696.397208] ip_local_deliver+0x7e/0xe0 [ 696.397215] ? ip_rcv_finish+0x430/0x430 [ 696.397223] ip_rcv_finish+0x129/0x430 [ 696.397230] ip_rcv+0x296/0x360 [ 696.397238] ? inet_del_offload+0x40/0x40 [ 696.397249] __netif_receive_skb_core+0x432/0xb80 [ 696.397261] ? skb_send_sock+0x50/0x50 [ 696.397271] ? tcp4_gro_receive+0x137/0x1a0 [ 696.397280] __netif_receive_skb+0x18/0x60 [ 696.397290] ? __netif_receive_skb+0x18/0x60 [ 696.397300] netif_receive_skb_internal+0x45/0xe0 [ 696.397309] napi_gro_receive+0xc5/0xf0 [ 696.397317] xennet_poll+0x9ca/0xbc0 [ 696.397325] net_rx_action+0x140/0x3a0 [ 696.397334] __do_softirq+0xe4/0x2d4 [ 696.397344] irq_exit+0xc5/0xd0 [ 696.397352] xen_evtchn_do_upcall+0x30/0x50 [ 696.397361] xen_hvm_callback_vector+0x90/0xa0 [ 696.397371] </IRQ> [ 696.397378] RIP: 0010:native_safe_halt+0x12/0x20 [ 696.397390] RSP: 0018:ffff94c4862cbe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c [ 696.397405] RAX: ffffffff8efc1800 RBX: 0000000000000006 RCX: 0000000000000000 [ 696.397419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 696.397435] RBP: ffff94c4862cbe80 R08: 0000000000000002 R09: 0000000000000001 [ 696.397449] R10: 0000000000100000 R11: 0000000000000397 R12: 0000000000000006 [ 696.397462] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 696.397479] ? __sched_text_end+0x1/0x1 [ 696.397489] default_idle+0x20/0x100 [ 696.397499] arch_cpu_idle+0x15/0x20 [ 696.397507] default_idle_call+0x23/0x30 [ 696.397515] do_idle+0x172/0x1f0 [ 696.397522] cpu_startup_entry+0x73/0x80 [ 696.397530] start_secondary+0x1ab/0x200 [ 696.397538] secondary_startup_64+0xa5/0xb0 [ 696.397545] Code: 89 5d b0 49 29 cc 45 01 a7 80 00 00 00 44 89 e1 48 29 c8 48 89 4d a8 49 89 87 d8 00 00 00 89 d2 48 8d 84 d6 38 03 00 00 48 8b 00 <4c> 8b 70 10 4c 8d 68 10 4d 85 f6 0f 84 f6 00 00 00 49 8d 47 30 [ 696.397584] RIP: __cgroup_bpf_run_filter_skb+0xbb/0x1e0 RSP: ffff893fdcb83a70 [ 696.397607] ---[ end trace ec5c84424d511a6f ]--- [ 696.397616] Kernel panic - not syncing: Fatal exception in interrupt [ 696.397876] Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) We've correlated some of the other crashes, and the ASCII was a bit of a red herring. All the others are a NULL pointer deference in the same place, so the problem is likely OoB memory read (possibly use-after-free) of a piece of memory which is usually zero, but not always. It is actually the control VM's for our test farms which were impacted, one of which was reliably crashing every 5 minutes or so, and others on more sporadic intervals up to about a day. In all cases, reverting to the -108 kernel has resolved the crashes. Unfortunately, attempts to repro this off our production environment with a packet trace aren't going quite so well. We're still experimenting.
2020-07-07 14:29:22	Steve Beattie	bug			added subscriber Thadeu Lima de Souza Cascardo
2020-07-07 14:29:33	Steve Beattie	bug			added subscriber Stefan Bader
2020-07-07 14:29:43	Steve Beattie	bug			added subscriber Terry Rudd
2020-07-07 14:29:54	Steve Beattie	bug			added subscriber Andy Whitcroft
2020-07-07 17:52:04	Thadeu Lima de Souza Cascardo	linux (Ubuntu): assignee		Thadeu Lima de Souza Cascardo (cascardo)
2020-07-08 09:21:05	Thadeu Lima de Souza Cascardo	bug watch added		https://bugzilla.kernel.org/show_bug.cgi?id=208003
2020-07-08 15:22:44	Steve Beattie	information type	Private Security	Public Security
2020-07-08 15:30:08	Ubuntu Kernel Bot	linux (Ubuntu): status	New	Incomplete
2020-07-08 15:30:09	Ubuntu Kernel Bot	tags		bionic
2020-07-08 18:01:25	Thadeu Lima de Souza Cascardo	attachment added		0001-UBUNTU-SAUCE-Revert-netprio_cgroup-Fix-unlimited-mem.patch https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1886668/+attachment/5390827/+files/0001-UBUNTU-SAUCE-Revert-netprio_cgroup-Fix-unlimited-mem.patch
2020-07-08 18:01:35	Thadeu Lima de Souza Cascardo	nominated for series		Ubuntu Bionic
2020-07-08 18:01:35	Thadeu Lima de Souza Cascardo	bug task added		linux (Ubuntu Bionic)
2020-07-08 18:01:49	Thadeu Lima de Souza Cascardo	linux (Ubuntu): status	Incomplete	Invalid
2020-07-08 18:01:56	Thadeu Lima de Souza Cascardo	linux (Ubuntu Bionic): status	New	In Progress
2020-07-08 18:01:59	Thadeu Lima de Souza Cascardo	linux (Ubuntu Bionic): assignee		Thadeu Lima de Souza Cascardo (cascardo)
2020-07-08 18:02:03	Thadeu Lima de Souza Cascardo	linux (Ubuntu Bionic): importance	Undecided	Critical
2020-07-08 18:07:31	Thadeu Lima de Souza Cascardo	description	Reported from a user: Several of our infrastructure VMs recently started crashing (oops attached), after they upgraded to -109. -108 appears to be stable. Analysing the crash, it appears to be a wild pointer access in a BPF filter, which makes this (probably) a network-traffic triggered crash. [ 696.396831] general protection fault: 0000 [#1] SMP PTI [ 696.396843] Modules linked in: iscsi_target_mod target_core_mod ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge nfsv3 cmac arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs aufs ccm fscache binfmt_misc overlay xfs libcrc32c intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev intel_rapl_perf serio_raw parport_pc parport mac_hid sch_fq_codel nfsd 8021q auth_rpcgss garp nfs_acl mrp lockd stp llc grace xenfs sunrpc xen_privcmd ip_tables x_tables autofs4 hid_generic usbhid hid psmouse i2c_piix4 pata_acpi floppy [ 696.396966] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-109-generic #110-Ubuntu [ 696.396979] Hardware name: Xen HVM domU, BIOS 4.7.6-1.26 12/03/2018 [ 696.396993] RIP: 0010:__cgroup_bpf_run_filter_skb+0xbb/0x1e0 [ 696.397005] RSP: 0018:ffff893fdcb83a70 EFLAGS: 00010292 [ 696.397015] RAX: 6d69546e6f697469 RBX: 0000000000000000 RCX: 0000000000000014 [ 696.397028] RDX: 0000000000000000 RSI: ffff893fd0360000 RDI: ffff893fb5154800 [ 696.397041] RBP: ffff893fdcb83ad0 R08: 0000000000000001 R09: 0000000000000000 [ 696.397058] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000014 [ 696.397075] R13: ffff893fb5154800 R14: 0000000000000020 R15: ffff893fc6ba4d00 [ 696.397091] FS: 0000000000000000(0000) GS:ffff893fdcb80000(0000) knlGS:0000000000000000 [ 696.397107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 696.397119] CR2: 000000c0001b4000 CR3: 00000006dce0a004 CR4: 00000000003606e0 [ 696.397135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 696.397152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 696.397169] Call Trace: [ 696.397175] <IRQ> [ 696.397183] sk_filter_trim_cap+0xd0/0x1b0 [ 696.397191] tcp_v4_rcv+0x8b7/0xa80 [ 696.397199] ip_local_deliver_finish+0x66/0x210 [ 696.397208] ip_local_deliver+0x7e/0xe0 [ 696.397215] ? ip_rcv_finish+0x430/0x430 [ 696.397223] ip_rcv_finish+0x129/0x430 [ 696.397230] ip_rcv+0x296/0x360 [ 696.397238] ? inet_del_offload+0x40/0x40 [ 696.397249] __netif_receive_skb_core+0x432/0xb80 [ 696.397261] ? skb_send_sock+0x50/0x50 [ 696.397271] ? tcp4_gro_receive+0x137/0x1a0 [ 696.397280] __netif_receive_skb+0x18/0x60 [ 696.397290] ? __netif_receive_skb+0x18/0x60 [ 696.397300] netif_receive_skb_internal+0x45/0xe0 [ 696.397309] napi_gro_receive+0xc5/0xf0 [ 696.397317] xennet_poll+0x9ca/0xbc0 [ 696.397325] net_rx_action+0x140/0x3a0 [ 696.397334] __do_softirq+0xe4/0x2d4 [ 696.397344] irq_exit+0xc5/0xd0 [ 696.397352] xen_evtchn_do_upcall+0x30/0x50 [ 696.397361] xen_hvm_callback_vector+0x90/0xa0 [ 696.397371] </IRQ> [ 696.397378] RIP: 0010:native_safe_halt+0x12/0x20 [ 696.397390] RSP: 0018:ffff94c4862cbe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c [ 696.397405] RAX: ffffffff8efc1800 RBX: 0000000000000006 RCX: 0000000000000000 [ 696.397419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 696.397435] RBP: ffff94c4862cbe80 R08: 0000000000000002 R09: 0000000000000001 [ 696.397449] R10: 0000000000100000 R11: 0000000000000397 R12: 0000000000000006 [ 696.397462] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 696.397479] ? __sched_text_end+0x1/0x1 [ 696.397489] default_idle+0x20/0x100 [ 696.397499] arch_cpu_idle+0x15/0x20 [ 696.397507] default_idle_call+0x23/0x30 [ 696.397515] do_idle+0x172/0x1f0 [ 696.397522] cpu_startup_entry+0x73/0x80 [ 696.397530] start_secondary+0x1ab/0x200 [ 696.397538] secondary_startup_64+0xa5/0xb0 [ 696.397545] Code: 89 5d b0 49 29 cc 45 01 a7 80 00 00 00 44 89 e1 48 29 c8 48 89 4d a8 49 89 87 d8 00 00 00 89 d2 48 8d 84 d6 38 03 00 00 48 8b 00 <4c> 8b 70 10 4c 8d 68 10 4d 85 f6 0f 84 f6 00 00 00 49 8d 47 30 [ 696.397584] RIP: __cgroup_bpf_run_filter_skb+0xbb/0x1e0 RSP: ffff893fdcb83a70 [ 696.397607] ---[ end trace ec5c84424d511a6f ]--- [ 696.397616] Kernel panic - not syncing: Fatal exception in interrupt [ 696.397876] Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) We've correlated some of the other crashes, and the ASCII was a bit of a red herring. All the others are a NULL pointer deference in the same place, so the problem is likely OoB memory read (possibly use-after-free) of a piece of memory which is usually zero, but not always. It is actually the control VM's for our test farms which were impacted, one of which was reliably crashing every 5 minutes or so, and others on more sporadic intervals up to about a day. In all cases, reverting to the -108 kernel has resolved the crashes. Unfortunately, attempts to repro this off our production environment with a packet trace aren't going quite so well. We're still experimenting.	[Impact] On systems using cgroups and sockets extensively, like docker, kubernetes, lxd, libvirt, a crash might happen when using linux 4.15.0-109-generic. [Fix] Revert the patch that disables sk_alloc cgroup refcounting when tasks are added to net_prio cgroup. [Test case] Test that such environments where the issue is reproduced survive some hours of uptime. See attached test case that reproduces a different but possibly related issue. [Regression potential] The reverted commit fix a memory leak on similar scenarios. But a leak is better than a crash. Two other bugs have been opened to track a real fix for this issue and the leak. ---------------------------------------------------------- Reported from a user: Several of our infrastructure VMs recently started crashing (oops attached), after they upgraded to -109. -108 appears to be stable. Analysing the crash, it appears to be a wild pointer access in a BPF filter, which makes this (probably) a network-traffic triggered crash. [ 696.396831] general protection fault: 0000 [#1] SMP PTI [ 696.396843] Modules linked in: iscsi_target_mod target_core_mod ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge nfsv3 cmac arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs aufs ccm fscache binfmt_misc overlay xfs libcrc32c intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev intel_rapl_perf serio_raw parport_pc parport mac_hid sch_fq_codel nfsd 8021q auth_rpcgss garp nfs_acl mrp lockd stp llc grace xenfs sunrpc xen_privcmd ip_tables x_tables autofs4 hid_generic usbhid hid psmouse i2c_piix4 pata_acpi floppy [ 696.396966] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-109-generic #110-Ubuntu [ 696.396979] Hardware name: Xen HVM domU, BIOS 4.7.6-1.26 12/03/2018 [ 696.396993] RIP: 0010:__cgroup_bpf_run_filter_skb+0xbb/0x1e0 [ 696.397005] RSP: 0018:ffff893fdcb83a70 EFLAGS: 00010292 [ 696.397015] RAX: 6d69546e6f697469 RBX: 0000000000000000 RCX: 0000000000000014 [ 696.397028] RDX: 0000000000000000 RSI: ffff893fd0360000 RDI: ffff893fb5154800 [ 696.397041] RBP: ffff893fdcb83ad0 R08: 0000000000000001 R09: 0000000000000000 [ 696.397058] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000014 [ 696.397075] R13: ffff893fb5154800 R14: 0000000000000020 R15: ffff893fc6ba4d00 [ 696.397091] FS: 0000000000000000(0000) GS:ffff893fdcb80000(0000) knlGS:0000000000000000 [ 696.397107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 696.397119] CR2: 000000c0001b4000 CR3: 00000006dce0a004 CR4: 00000000003606e0 [ 696.397135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 696.397152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 696.397169] Call Trace: [ 696.397175] <IRQ> [ 696.397183] sk_filter_trim_cap+0xd0/0x1b0 [ 696.397191] tcp_v4_rcv+0x8b7/0xa80 [ 696.397199] ip_local_deliver_finish+0x66/0x210 [ 696.397208] ip_local_deliver+0x7e/0xe0 [ 696.397215] ? ip_rcv_finish+0x430/0x430 [ 696.397223] ip_rcv_finish+0x129/0x430 [ 696.397230] ip_rcv+0x296/0x360 [ 696.397238] ? inet_del_offload+0x40/0x40 [ 696.397249] __netif_receive_skb_core+0x432/0xb80 [ 696.397261] ? skb_send_sock+0x50/0x50 [ 696.397271] ? tcp4_gro_receive+0x137/0x1a0 [ 696.397280] __netif_receive_skb+0x18/0x60 [ 696.397290] ? __netif_receive_skb+0x18/0x60 [ 696.397300] netif_receive_skb_internal+0x45/0xe0 [ 696.397309] napi_gro_receive+0xc5/0xf0 [ 696.397317] xennet_poll+0x9ca/0xbc0 [ 696.397325] net_rx_action+0x140/0x3a0 [ 696.397334] __do_softirq+0xe4/0x2d4 [ 696.397344] irq_exit+0xc5/0xd0 [ 696.397352] xen_evtchn_do_upcall+0x30/0x50 [ 696.397361] xen_hvm_callback_vector+0x90/0xa0 [ 696.397371] </IRQ> [ 696.397378] RIP: 0010:native_safe_halt+0x12/0x20 [ 696.397390] RSP: 0018:ffff94c4862cbe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c [ 696.397405] RAX: ffffffff8efc1800 RBX: 0000000000000006 RCX: 0000000000000000 [ 696.397419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 696.397435] RBP: ffff94c4862cbe80 R08: 0000000000000002 R09: 0000000000000001 [ 696.397449] R10: 0000000000100000 R11: 0000000000000397 R12: 0000000000000006 [ 696.397462] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 696.397479] ? __sched_text_end+0x1/0x1 [ 696.397489] default_idle+0x20/0x100 [ 696.397499] arch_cpu_idle+0x15/0x20 [ 696.397507] default_idle_call+0x23/0x30 [ 696.397515] do_idle+0x172/0x1f0 [ 696.397522] cpu_startup_entry+0x73/0x80 [ 696.397530] start_secondary+0x1ab/0x200 [ 696.397538] secondary_startup_64+0xa5/0xb0 [ 696.397545] Code: 89 5d b0 49 29 cc 45 01 a7 80 00 00 00 44 89 e1 48 29 c8 48 89 4d a8 49 89 87 d8 00 00 00 89 d2 48 8d 84 d6 38 03 00 00 48 8b 00 <4c> 8b 70 10 4c 8d 68 10 4d 85 f6 0f 84 f6 00 00 00 49 8d 47 30 [ 696.397584] RIP: __cgroup_bpf_run_filter_skb+0xbb/0x1e0 RSP: ffff893fdcb83a70 [ 696.397607] ---[ end trace ec5c84424d511a6f ]--- [ 696.397616] Kernel panic - not syncing: Fatal exception in interrupt [ 696.397876] Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) We've correlated some of the other crashes, and the ASCII was a bit of a red herring. All the others are a NULL pointer deference in the same place, so the problem is likely OoB memory read (possibly use-after-free) of a piece of memory which is usually zero, but not always. It is actually the control VM's for our test farms which were impacted, one of which was reliably crashing every 5 minutes or so, and others on more sporadic intervals up to about a day. In all cases, reverting to the -108 kernel has resolved the crashes. Unfortunately, attempts to repro this off our production environment with a packet trace aren't going quite so well. We're still experimenting.
2020-07-08 20:49:43	Thadeu Lima de Souza Cascardo	description	[Impact] On systems using cgroups and sockets extensively, like docker, kubernetes, lxd, libvirt, a crash might happen when using linux 4.15.0-109-generic. [Fix] Revert the patch that disables sk_alloc cgroup refcounting when tasks are added to net_prio cgroup. [Test case] Test that such environments where the issue is reproduced survive some hours of uptime. See attached test case that reproduces a different but possibly related issue. [Regression potential] The reverted commit fix a memory leak on similar scenarios. But a leak is better than a crash. Two other bugs have been opened to track a real fix for this issue and the leak. ---------------------------------------------------------- Reported from a user: Several of our infrastructure VMs recently started crashing (oops attached), after they upgraded to -109. -108 appears to be stable. Analysing the crash, it appears to be a wild pointer access in a BPF filter, which makes this (probably) a network-traffic triggered crash. [ 696.396831] general protection fault: 0000 [#1] SMP PTI [ 696.396843] Modules linked in: iscsi_target_mod target_core_mod ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge nfsv3 cmac arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs aufs ccm fscache binfmt_misc overlay xfs libcrc32c intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev intel_rapl_perf serio_raw parport_pc parport mac_hid sch_fq_codel nfsd 8021q auth_rpcgss garp nfs_acl mrp lockd stp llc grace xenfs sunrpc xen_privcmd ip_tables x_tables autofs4 hid_generic usbhid hid psmouse i2c_piix4 pata_acpi floppy [ 696.396966] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-109-generic #110-Ubuntu [ 696.396979] Hardware name: Xen HVM domU, BIOS 4.7.6-1.26 12/03/2018 [ 696.396993] RIP: 0010:__cgroup_bpf_run_filter_skb+0xbb/0x1e0 [ 696.397005] RSP: 0018:ffff893fdcb83a70 EFLAGS: 00010292 [ 696.397015] RAX: 6d69546e6f697469 RBX: 0000000000000000 RCX: 0000000000000014 [ 696.397028] RDX: 0000000000000000 RSI: ffff893fd0360000 RDI: ffff893fb5154800 [ 696.397041] RBP: ffff893fdcb83ad0 R08: 0000000000000001 R09: 0000000000000000 [ 696.397058] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000014 [ 696.397075] R13: ffff893fb5154800 R14: 0000000000000020 R15: ffff893fc6ba4d00 [ 696.397091] FS: 0000000000000000(0000) GS:ffff893fdcb80000(0000) knlGS:0000000000000000 [ 696.397107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 696.397119] CR2: 000000c0001b4000 CR3: 00000006dce0a004 CR4: 00000000003606e0 [ 696.397135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 696.397152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 696.397169] Call Trace: [ 696.397175] <IRQ> [ 696.397183] sk_filter_trim_cap+0xd0/0x1b0 [ 696.397191] tcp_v4_rcv+0x8b7/0xa80 [ 696.397199] ip_local_deliver_finish+0x66/0x210 [ 696.397208] ip_local_deliver+0x7e/0xe0 [ 696.397215] ? ip_rcv_finish+0x430/0x430 [ 696.397223] ip_rcv_finish+0x129/0x430 [ 696.397230] ip_rcv+0x296/0x360 [ 696.397238] ? inet_del_offload+0x40/0x40 [ 696.397249] __netif_receive_skb_core+0x432/0xb80 [ 696.397261] ? skb_send_sock+0x50/0x50 [ 696.397271] ? tcp4_gro_receive+0x137/0x1a0 [ 696.397280] __netif_receive_skb+0x18/0x60 [ 696.397290] ? __netif_receive_skb+0x18/0x60 [ 696.397300] netif_receive_skb_internal+0x45/0xe0 [ 696.397309] napi_gro_receive+0xc5/0xf0 [ 696.397317] xennet_poll+0x9ca/0xbc0 [ 696.397325] net_rx_action+0x140/0x3a0 [ 696.397334] __do_softirq+0xe4/0x2d4 [ 696.397344] irq_exit+0xc5/0xd0 [ 696.397352] xen_evtchn_do_upcall+0x30/0x50 [ 696.397361] xen_hvm_callback_vector+0x90/0xa0 [ 696.397371] </IRQ> [ 696.397378] RIP: 0010:native_safe_halt+0x12/0x20 [ 696.397390] RSP: 0018:ffff94c4862cbe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c [ 696.397405] RAX: ffffffff8efc1800 RBX: 0000000000000006 RCX: 0000000000000000 [ 696.397419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 696.397435] RBP: ffff94c4862cbe80 R08: 0000000000000002 R09: 0000000000000001 [ 696.397449] R10: 0000000000100000 R11: 0000000000000397 R12: 0000000000000006 [ 696.397462] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 696.397479] ? __sched_text_end+0x1/0x1 [ 696.397489] default_idle+0x20/0x100 [ 696.397499] arch_cpu_idle+0x15/0x20 [ 696.397507] default_idle_call+0x23/0x30 [ 696.397515] do_idle+0x172/0x1f0 [ 696.397522] cpu_startup_entry+0x73/0x80 [ 696.397530] start_secondary+0x1ab/0x200 [ 696.397538] secondary_startup_64+0xa5/0xb0 [ 696.397545] Code: 89 5d b0 49 29 cc 45 01 a7 80 00 00 00 44 89 e1 48 29 c8 48 89 4d a8 49 89 87 d8 00 00 00 89 d2 48 8d 84 d6 38 03 00 00 48 8b 00 <4c> 8b 70 10 4c 8d 68 10 4d 85 f6 0f 84 f6 00 00 00 49 8d 47 30 [ 696.397584] RIP: __cgroup_bpf_run_filter_skb+0xbb/0x1e0 RSP: ffff893fdcb83a70 [ 696.397607] ---[ end trace ec5c84424d511a6f ]--- [ 696.397616] Kernel panic - not syncing: Fatal exception in interrupt [ 696.397876] Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) We've correlated some of the other crashes, and the ASCII was a bit of a red herring. All the others are a NULL pointer deference in the same place, so the problem is likely OoB memory read (possibly use-after-free) of a piece of memory which is usually zero, but not always. It is actually the control VM's for our test farms which were impacted, one of which was reliably crashing every 5 minutes or so, and others on more sporadic intervals up to about a day. In all cases, reverting to the -108 kernel has resolved the crashes. Unfortunately, attempts to repro this off our production environment with a packet trace aren't going quite so well. We're still experimenting.	[Impact] On systems using cgroups and sockets extensively, like docker, kubernetes, lxd, libvirt, a crash might happen when using linux 4.15.0-109-generic. [Fix] Revert the patch that disables sk_alloc cgroup refcounting when tasks are added to net_prio cgroup. [Test case] Test that such environments where the issue is reproduced survive some hours of uptime. A different bug was reproduced with a work-in-progress code and was not reproduced with the culprit reverted. [Regression potential] The reverted commit fix a memory leak on similar scenarios. But a leak is better than a crash. Two other bugs have been opened to track a real fix for this issue and the leak. ---------------------------------------------------------- Reported from a user: Several of our infrastructure VMs recently started crashing (oops attached), after they upgraded to -109. -108 appears to be stable. Analysing the crash, it appears to be a wild pointer access in a BPF filter, which makes this (probably) a network-traffic triggered crash. [ 696.396831] general protection fault: 0000 [#1] SMP PTI [ 696.396843] Modules linked in: iscsi_target_mod target_core_mod ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge nfsv3 cmac arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs aufs ccm fscache binfmt_misc overlay xfs libcrc32c intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev intel_rapl_perf serio_raw parport_pc parport mac_hid sch_fq_codel nfsd 8021q auth_rpcgss garp nfs_acl mrp lockd stp llc grace xenfs sunrpc xen_privcmd ip_tables x_tables autofs4 hid_generic usbhid hid psmouse i2c_piix4 pata_acpi floppy [ 696.396966] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-109-generic #110-Ubuntu [ 696.396979] Hardware name: Xen HVM domU, BIOS 4.7.6-1.26 12/03/2018 [ 696.396993] RIP: 0010:__cgroup_bpf_run_filter_skb+0xbb/0x1e0 [ 696.397005] RSP: 0018:ffff893fdcb83a70 EFLAGS: 00010292 [ 696.397015] RAX: 6d69546e6f697469 RBX: 0000000000000000 RCX: 0000000000000014 [ 696.397028] RDX: 0000000000000000 RSI: ffff893fd0360000 RDI: ffff893fb5154800 [ 696.397041] RBP: ffff893fdcb83ad0 R08: 0000000000000001 R09: 0000000000000000 [ 696.397058] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000014 [ 696.397075] R13: ffff893fb5154800 R14: 0000000000000020 R15: ffff893fc6ba4d00 [ 696.397091] FS: 0000000000000000(0000) GS:ffff893fdcb80000(0000) knlGS:0000000000000000 [ 696.397107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 696.397119] CR2: 000000c0001b4000 CR3: 00000006dce0a004 CR4: 00000000003606e0 [ 696.397135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 696.397152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 696.397169] Call Trace: [ 696.397175] <IRQ> [ 696.397183] sk_filter_trim_cap+0xd0/0x1b0 [ 696.397191] tcp_v4_rcv+0x8b7/0xa80 [ 696.397199] ip_local_deliver_finish+0x66/0x210 [ 696.397208] ip_local_deliver+0x7e/0xe0 [ 696.397215] ? ip_rcv_finish+0x430/0x430 [ 696.397223] ip_rcv_finish+0x129/0x430 [ 696.397230] ip_rcv+0x296/0x360 [ 696.397238] ? inet_del_offload+0x40/0x40 [ 696.397249] __netif_receive_skb_core+0x432/0xb80 [ 696.397261] ? skb_send_sock+0x50/0x50 [ 696.397271] ? tcp4_gro_receive+0x137/0x1a0 [ 696.397280] __netif_receive_skb+0x18/0x60 [ 696.397290] ? __netif_receive_skb+0x18/0x60 [ 696.397300] netif_receive_skb_internal+0x45/0xe0 [ 696.397309] napi_gro_receive+0xc5/0xf0 [ 696.397317] xennet_poll+0x9ca/0xbc0 [ 696.397325] net_rx_action+0x140/0x3a0 [ 696.397334] __do_softirq+0xe4/0x2d4 [ 696.397344] irq_exit+0xc5/0xd0 [ 696.397352] xen_evtchn_do_upcall+0x30/0x50 [ 696.397361] xen_hvm_callback_vector+0x90/0xa0 [ 696.397371] </IRQ> [ 696.397378] RIP: 0010:native_safe_halt+0x12/0x20 [ 696.397390] RSP: 0018:ffff94c4862cbe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c [ 696.397405] RAX: ffffffff8efc1800 RBX: 0000000000000006 RCX: 0000000000000000 [ 696.397419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 696.397435] RBP: ffff94c4862cbe80 R08: 0000000000000002 R09: 0000000000000001 [ 696.397449] R10: 0000000000100000 R11: 0000000000000397 R12: 0000000000000006 [ 696.397462] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 696.397479] ? __sched_text_end+0x1/0x1 [ 696.397489] default_idle+0x20/0x100 [ 696.397499] arch_cpu_idle+0x15/0x20 [ 696.397507] default_idle_call+0x23/0x30 [ 696.397515] do_idle+0x172/0x1f0 [ 696.397522] cpu_startup_entry+0x73/0x80 [ 696.397530] start_secondary+0x1ab/0x200 [ 696.397538] secondary_startup_64+0xa5/0xb0 [ 696.397545] Code: 89 5d b0 49 29 cc 45 01 a7 80 00 00 00 44 89 e1 48 29 c8 48 89 4d a8 49 89 87 d8 00 00 00 89 d2 48 8d 84 d6 38 03 00 00 48 8b 00 <4c> 8b 70 10 4c 8d 68 10 4d 85 f6 0f 84 f6 00 00 00 49 8d 47 30 [ 696.397584] RIP: __cgroup_bpf_run_filter_skb+0xbb/0x1e0 RSP: ffff893fdcb83a70 [ 696.397607] ---[ end trace ec5c84424d511a6f ]--- [ 696.397616] Kernel panic - not syncing: Fatal exception in interrupt [ 696.397876] Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) We've correlated some of the other crashes, and the ASCII was a bit of a red herring. All the others are a NULL pointer deference in the same place, so the problem is likely OoB memory read (possibly use-after-free) of a piece of memory which is usually zero, but not always. It is actually the control VM's for our test farms which were impacted, one of which was reliably crashing every 5 minutes or so, and others on more sporadic intervals up to about a day. In all cases, reverting to the -108 kernel has resolved the crashes. Unfortunately, attempts to repro this off our production environment with a packet trace aren't going quite so well. We're still experimenting.
2020-07-08 23:14:43	Khaled El Mously	linux (Ubuntu Bionic): status	In Progress	Fix Committed
2020-07-09 19:42:18	Thadeu Lima de Souza Cascardo	nominated for series		Ubuntu Groovy
2020-07-09 19:42:18	Thadeu Lima de Souza Cascardo	bug task added		linux (Ubuntu Groovy)
2020-07-09 19:42:18	Thadeu Lima de Souza Cascardo	nominated for series		Ubuntu Focal
2020-07-09 19:42:18	Thadeu Lima de Souza Cascardo	bug task added		linux (Ubuntu Focal)
2020-07-09 19:42:18	Thadeu Lima de Souza Cascardo	nominated for series		Ubuntu Eoan
2020-07-09 19:42:18	Thadeu Lima de Souza Cascardo	bug task added		linux (Ubuntu Eoan)
2020-07-09 19:42:36	Thadeu Lima de Souza Cascardo	linux (Ubuntu Groovy): status	Invalid	In Progress
2020-07-09 19:42:39	Thadeu Lima de Souza Cascardo	linux (Ubuntu Focal): status	New	In Progress
2020-07-09 19:42:42	Thadeu Lima de Souza Cascardo	linux (Ubuntu Eoan): status	New	In Progress
2020-07-09 20:27:24	Ubuntu Foundations Team Bug Bot	tags	bionic	bionic patch
2020-07-09 23:41:10	Khaled El Mously	linux (Ubuntu Eoan): status	In Progress	Fix Committed
2020-07-09 23:41:12	Khaled El Mously	linux (Ubuntu Focal): status	In Progress	Fix Committed
2020-07-10 06:44:06	Sean Groarke	bug			added subscriber Sean Groarke
2020-07-10 20:47:28	Ubuntu Kernel Bot	tags	bionic patch	bionic patch verification-needed-bionic
2020-07-11 19:32:09	Peter Stevenson	bug			added subscriber Peter Stevenson
2020-07-11 21:22:17	Ubuntu Kernel Bot	tags	bionic patch verification-needed-bionic	bionic patch verification-needed-bionic verification-needed-focal
2020-07-12 09:47:12	Ubuntu Kernel Bot	tags	bionic patch verification-needed-bionic verification-needed-focal	bionic patch verification-needed-bionic verification-needed-eoan verification-needed-focal
2020-07-12 12:07:08	Tom Barron	bug			added subscriber Tom Barron
2020-07-12 19:21:21	Dr. Jens Harbott	bug			added subscriber Dr. Jens Harbott
2020-07-13 12:14:27	Kleber Sacilotto de Souza	tags	bionic patch verification-needed-bionic verification-needed-eoan verification-needed-focal	bionic patch verification-done-bionic verification-needed-eoan verification-needed-focal
2020-07-13 15:20:43	Launchpad Janitor	linux (Ubuntu Bionic): status	Fix Committed	Fix Released
2020-07-18 06:41:50	Julian Edwards	bug			added subscriber Julian Edwards
2020-07-20 16:18:05	Launchpad Janitor	linux (Ubuntu Focal): status	Fix Committed	Fix Released
2020-07-20 16:18:05	Launchpad Janitor	cve linked		2019-16089
2020-07-20 16:18:05	Launchpad Janitor	cve linked		2019-19642
2020-07-20 16:18:05	Launchpad Janitor	cve linked		2020-11935
2020-07-23 08:53:03	Janåke Rönnblom	bug			added subscriber Janåke Rönnblom
2020-07-27 15:04:04	Launchpad Janitor	linux (Ubuntu Eoan): status	Fix Committed	Fix Released
2020-07-27 15:04:04	Launchpad Janitor	cve linked		2020-10757
2020-07-28 00:57:39	Launchpad Janitor	linux (Ubuntu Groovy): status	In Progress	Fix Released

Ubuntulinux package

Activity log for bug #1886668

Ubuntu
linux package