Kernel panics on Xenial when using cgroups and strict CFS limits

Bug #1687512 reported by Daniel Axtens on 2017-05-02
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Daniel Axtens
Xenial
High
Unassigned

Bug Description

SRU Justification
-----------------

[Impact]
Apache Mesos and Kubernetes workloads on Xenial cause a panic
(NULL pointer dereference) in the completely fair scheduler.

These panics are in pick_next_entity and include pick_next_task_fair
in the call stack.

[Fix]
Cherry-picking both
754bd598be9bbc953bc709a9e8ed7f3188bfb9d7
(http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz)
and
094f469172e00d6ab0a3130b0e01c83b3cf3a98d
(http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz)
fix the crash.
They appear to be intended as a series - they were posted to LKML at
the same time.

[Testcase]
The fix has been validated by the user who reported the bug

Bug description
---------------

We see a number of kernel panics on servers running Apache Mesos using cgroups with small (0.1-0.2) cpu limits.

These all appear as NULL pointer dereferences in and around pick_next_entity and pick_next_task_fair, for example:

[24334.493331] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[24334.501611] IP: [<ffffffff810b2f0f>] pick_next_entity+0x7f/0x160
[24334.507868] PGD 3eacfa067 PUD 3eacfb067 PMD 0
[24334.512806] Oops: 0000 [#1] SMP
[24334.516420] Modules linked in: ipvlan xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs tcp_diag inet_diag nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache dm_crypt ppdev input_leds mac_hid i2c_piix4 8250_fintek parport_pc pvpanic parport serio_raw crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi
[24334.576359] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.4.0-66-generic #87~14.04.1-Ubuntu
[24334.584748] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[24334.594188] task: ffff8803ee671c00 ti: ffff8803ee67c000 task.ti: ffff8803ee67c000
[24334.601799] RIP: 0010:[<ffffffff810b2f0f>] [<ffffffff810b2f0f>] pick_next_entity+0x7f/0x160
[24334.610490] RSP: 0018:ffff8803ee67fdd8 EFLAGS: 00010086
[24334.615924] RAX: ffff8803ebed4c00 RBX: ffff880036529800 RCX: 0000000000000000
[24334.623190] RDX: 000000000225341f RSI: 0000000000000000 RDI: 0000000000000000
[24334.630479] RBP: ffff8803ee67fe00 R08: 0000000000000004 R09: 0000000000000000
[24334.637758] R10: ffff8803e7ed7600 R11: 0000000000000001 R12: 0000000000000000
[24334.645153] R13: 0000000000000000 R14: 00000009067729c4 R15: ffff8803ee672178
[24334.652512] FS: 0000000000000000(0000) GS:ffff8803ffd00000(0000) knlGS:0000000000000000
[24334.660721] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[24334.666587] CR2: 0000000000000050 CR3: 00000003eacf9000 CR4: 00000000001406e0
[24334.673851] Stack:
[24334.675980] ffff8803ffd16e00 ffff8803ffd16e00 ffff8803e855a200 ffff880036529800
[24334.683995] 0000000000000002 ffff8803ee67fe68 ffffffff810b98a6 ffff8803ffd16e70
[24334.692024] 0000000000016e00 ffff8803e7ed7600 ffff8803ee671c00 0000000000000000
[24334.700172] Call Trace:
[24334.702750] [<ffffffff810b98a6>] pick_next_task_fair+0x66/0x4b0
[24334.708886] [<ffffffff818043c4>] __schedule+0x7f4/0x980
[24334.714349] [<ffffffff81804585>] schedule+0x35/0x80
[24334.719445] [<ffffffff8180481e>] schedule_preempt_disabled+0xe/0x10
[24334.725962] [<ffffffff810bf9fa>] cpu_startup_entry+0x18a/0x350
[24334.732012] [<ffffffff8104f3d9>] start_secondary+0x149/0x170
[24334.737895] Code: 8b 70 50 4d 2b 74 24 50 4d 85 f6 7e 59 4c 89 e7 e8 67 ff ff ff 49 39 c6 7f 04 4c 8b 6b 48 48 8b 43 40 48 85 c0 74 1f 4c 8b 70 50 <4d> 2b 74 24 50 4d 85 f6 7e 2c 4c 89 e7 e8 3f ff ff ff 49 39 c6
[24334.765124] RIP [<ffffffff810b2f0f>] pick_next_entity+0x7f/0x160
[24334.771473] RSP <ffff8803ee67fdd8>
[24334.775077] CR2: 0000000000000050
[24334.779121] ---[ end trace 05d941efb97b7bae ]---

and

[155852.028575] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[155852.036931] IP: [<ffffffff810b2f0f>] pick_next_entity+0x7f/0x160
[155852.043491] PGD 3ebae8067 PUD 3ebae9067 PMD 0
[155852.048550] Oops: 0000 [#1] SMP
[155852.052437] Modules linked in: ipvlan veth xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache dm_crypt ppdev input_leds mac_hid i2c_piix4 parport_pc 8250_fintek pvpanic parport serio_raw crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi
[155852.109847] CPU: 1 PID: 2215 Comm: ruby Not tainted 4.4.0-66-generic #87~14.04.1-Ubuntu
[155852.118233] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[155852.127661] task: ffff8803ed29aa00 ti: ffff8800bbb10000 task.ti: ffff8800bbb10000
[155852.135347] RIP: 0010:[<ffffffff810b2f0f>] [<ffffffff810b2f0f>] pick_next_entity+0x7f/0x160
[155852.144120] RSP: 0018:ffff8800bbb13ce0 EFLAGS: 00010086
[155852.149631] RAX: ffff8801725b5c00 RBX: ffff8800bb777600 RCX: ffff8800bb777400
[155852.156970] RDX: ffff8803ffc96e70 RSI: 0000000000000000 RDI: 0000000000000000
[155852.164384] RBP: ffff8800bbb13d08 R08: ffff8803eb92e800 R09: ffff8803ed29aa00
[155852.171718] R10: 0000000000000001 R11: 00000000000003cb R12: 0000000000000000
[155852.179052] R13: 0000000000000000 R14: 000009ad6846ff10 R15: 0000000000000001
[155852.186387] FS: 00007f387d1c9700(0000) GS:ffff8803ffc80000(0000) knlGS:0000000000000000
[155852.194677] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[155852.200626] CR2: 0000000000000050 CR3: 00000003eb706000 CR4: 00000000001406e0
[155852.207967] Stack:
[155852.210180] ffffffff810369c9 ffff8803ffc96e00 ffff8800bb777600 0000000000000000
[155852.218278] 00000000000012a4 ffff8800bbb13d70 ffffffff810b9b65 ffff8803ffc96e70
[155852.226402] 0000000000016e00 00008dbf20ccb260 ffff8803ed29aa00 0000000000000001
[155852.234506] Call Trace:
[155852.237156] [<ffffffff810369c9>] ? sched_clock+0x9/0x10
[155852.242673] [<ffffffff810b9b65>] pick_next_task_fair+0x325/0x4b0
[155852.248968] [<ffffffff81803cd9>] __schedule+0x109/0x980
[155852.254491] [<ffffffff81804585>] schedule+0x35/0x80
[155852.259667] [<ffffffff8180727c>] schedule_hrtimeout_range_clock+0xac/0x130
[155852.266838] [<ffffffff810e9fb0>] ? hrtimer_init+0x180/0x180
[155852.272712] [<ffffffff81807270>] ? schedule_hrtimeout_range_clock+0xa0/0x130
[155852.280052] [<ffffffff81807313>] schedule_hrtimeout_range+0x13/0x20
[155852.288558] [<ffffffff812479b9>] ep_poll+0x249/0x310
[155852.293817] [<ffffffff810a8c30>] ? wake_up_q+0x80/0x80
[155852.299271] [<ffffffff81248efc>] SyS_epoll_wait+0xbc/0xe0
[155852.304967] [<ffffffff81807df6>] entry_SYSCALL_64_fastpath+0x16/0x75
[155852.311618] Code: 8b 70 50 4d 2b 74 24 50 4d 85 f6 7e 59 4c 89 e7 e8 67 ff ff ff 49 39 c6 7f 04 4c 8b 6b 48 48 8b 43 40 48 85 c0 74 1f 4c 8b 70 50 <4d> 2b 74 24 50 4d 85 f6 7e 2c 4c 89 e7 e8 3f ff ff ff 49 39 c6
[155852.338852] RIP [<ffffffff810b2f0f>] pick_next_entity+0x7f/0x160
[155852.345270] RSP <ffff8800bbb13ce0>
[155852.348958] CR2: 0000000000000050
[155852.353086] ---[ end trace 8ce693b2314611c4 ]---

Similar issues have been reported in the community for kernels based on 4.4: https://github.com/kubernetes/kops/issues/874

These panics occur in the CFS code when a next buddy is set on an entity that is not on a run-queue. This causes pick_next_entity to end up with curr == left == NULL, which means it will call into wakeup_preempt_entity() with a valid next buddy and a NULL left, which it will try to dereference, causing a panic.

This was confirmed by placing a WARN_ON_ONCE in set_next_buddy to catch when a sched_entity in the hierarchy was not on_rq, as per https://marc.info/?l=linux-kernel&m=146651668921468&w=2

The stack-trace for the WARN is quite involved:

Apr 25 14:14:48 (none) kernel: [ 5339.764597] ------------[ cut here ]------------
Apr 25 14:14:48 (none) kernel: [ 5339.764606] WARNING: CPU: 1 PID: 13121 at /build/linux-PwPelj/linux-4.4.0/kernel/sched/fair.c:5170 set_next_buddy+0x55/0x70()
Apr 25 14:14:48 (none) kernel: [ 5339.764608] Modules linked in: xt_nat xt_tcpudp ipvlan ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs nfsd auth_rpcgss nfs_acl nfs dm_crypt lockd grace sunrpc fscache ppdev input_leds serio_raw parport_pc 8250_fintek parport pvpanic mac_hid i2c_piix4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi
Apr 25 14:14:48 (none) kernel: [ 5339.764644] CPU: 1 PID: 13121 Comm: executor Not tainted 4.4.0-72-generic #93+hf135461v20170420b2-Ubuntu
Apr 25 14:14:48 (none) kernel: [ 5339.764646] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Apr 25 14:14:48 (none) kernel: [ 5339.764647] 0000000000000086 00000000d5fbe9e0 ffff8803ed947608 ffffffff813f83c3
Apr 25 14:14:48 (none) kernel: [ 5339.764650] 0000000000000000 ffffffff81cbae20 ffff8803ed947640 ffffffff81081302
Apr 25 14:14:48 (none) kernel: [ 5339.764652] ffff8800bb5fc800 ffff8803e7c9f000 0000000000000008 ffff8800ba1bd400
Apr 25 14:14:48 (none) kernel: [ 5339.764655] Call Trace:
Apr 25 14:14:48 (none) kernel: [ 5339.764665] [<ffffffff813f83c3>] dump_stack+0x63/0x90
Apr 25 14:14:48 (none) kernel: [ 5339.764669] [<ffffffff81081302>] warn_slowpath_common+0x82/0xc0
Apr 25 14:14:48 (none) kernel: [ 5339.764672] [<ffffffff8108144a>] warn_slowpath_null+0x1a/0x20
Apr 25 14:14:48 (none) kernel: [ 5339.764674] [<ffffffff810b52b5>] set_next_buddy+0x55/0x70
Apr 25 14:14:48 (none) kernel: [ 5339.764676] [<ffffffff810b59a4>] check_preempt_wakeup+0x244/0x250
Apr 25 14:14:48 (none) kernel: [ 5339.764679] [<ffffffff810ab580>] check_preempt_curr+0x80/0x90
Apr 25 14:14:48 (none) kernel: [ 5339.764682] [<ffffffff810b42eb>] attach_task+0x4b/0x60
Apr 25 14:14:48 (none) kernel: [ 5339.764685] [<ffffffff810be067>] load_balance+0x5b7/0x980
Apr 25 14:14:48 (none) kernel: [ 5339.764688] [<ffffffff810be6e1>] pick_next_task_fair+0x2b1/0x4f0
Apr 25 14:14:48 (none) kernel: [ 5339.764692] [<ffffffff81837c5f>] __schedule+0x15f/0xa30
Apr 25 14:14:48 (none) kernel: [ 5339.764694] [<ffffffff81838565>] schedule+0x35/0x80
Apr 25 14:14:48 (none) kernel: [ 5339.764697] [<ffffffff8183ba85>] schedule_hrtimeout_range_clock+0xc5/0x1b0
Apr 25 14:14:48 (none) kernel: [ 5339.764700] [<ffffffff810ef880>] ? __hrtimer_init+0x90/0x90
Apr 25 14:14:48 (none) kernel: [ 5339.764703] [<ffffffff8183ba79>] ? schedule_hrtimeout_range_clock+0xb9/0x1b0
Apr 25 14:14:48 (none) kernel: [ 5339.764705] [<ffffffff8183bb83>] schedule_hrtimeout_range+0x13/0x20
Apr 25 14:14:48 (none) kernel: [ 5339.764709] [<ffffffff81223914>] poll_schedule_timeout+0x44/0x70
Apr 25 14:14:48 (none) kernel: [ 5339.764711] [<ffffffff81224407>] do_select+0x727/0x810
Apr 25 14:14:48 (none) kernel: [ 5339.764715] [<ffffffff811fb932>] ? page_counter_uncharge+0x22/0x40
Apr 25 14:14:48 (none) kernel: [ 5339.764718] [<ffffffff811fdb1c>] ? drain_stock.isra.33+0x6c/0xa0
Apr 25 14:14:48 (none) kernel: [ 5339.764720] [<ffffffff810b5349>] ? update_curr+0x79/0x160
Apr 25 14:14:48 (none) kernel: [ 5339.764722] [<ffffffff810b550c>] ? update_cfs_shares+0xbc/0x100
Apr 25 14:14:48 (none) kernel: [ 5339.764724] [<ffffffff810b742b>] ? dequeue_entity+0x41b/0xa80
Apr 25 14:14:48 (none) kernel: [ 5339.764729] [<ffffffff810719f7>] ? gup_pud_range+0x127/0x220
Apr 25 14:14:48 (none) kernel: [ 5339.764731] [<ffffffff810baa9c>] ? set_next_entity+0x9c/0xb0
Apr 25 14:14:48 (none) kernel: [ 5339.764736] [<ffffffff8102d66c>] ? __switch_to+0x1dc/0x5c0
Apr 25 14:14:48 (none) kernel: [ 5339.764740] [<ffffffff81401304>] ? timerqueue_del+0x24/0x70
Apr 25 14:14:48 (none) kernel: [ 5339.764742] [<ffffffff810efa3c>] ? __remove_hrtimer+0x3c/0x90
Apr 25 14:14:48 (none) kernel: [ 5339.764744] [<ffffffff810efb61>] ? hrtimer_try_to_cancel+0xd1/0x130
Apr 25 14:14:48 (none) kernel: [ 5339.764746] [<ffffffff810efbd9>] ? hrtimer_cancel+0x19/0x20
Apr 25 14:14:48 (none) kernel: [ 5339.764751] [<ffffffff81101166>] ? futex_wait+0x206/0x280
Apr 25 14:14:48 (none) kernel: [ 5339.764753] [<ffffffff810ab5a9>] ? ttwu_do_wakeup+0x19/0xe0
Apr 25 14:14:48 (none) kernel: [ 5339.764756] [<ffffffff812246bf>] core_sys_select+0x1cf/0x2f0
Apr 25 14:14:48 (none) kernel: [ 5339.764758] [<ffffffff810ef880>] ? __hrtimer_init+0x90/0x90
Apr 25 14:14:48 (none) kernel: [ 5339.764762] [<ffffffff81128447>] ? audit_filter_rules+0x217/0xe30
Apr 25 14:14:48 (none) kernel: [ 5339.764764] [<ffffffff81103860>] ? do_futex+0x120/0x540
Apr 25 14:14:48 (none) kernel: [ 5339.764768] [<ffffffff8106428e>] ? kvm_clock_get_cycles+0x1e/0x20
Apr 25 14:14:48 (none) kernel: [ 5339.764772] [<ffffffff810f53aa>] ? ktime_get_ts64+0x4a/0xf0
Apr 25 14:14:48 (none) kernel: [ 5339.764774] [<ffffffff8122489a>] SyS_select+0xba/0x110
Apr 25 14:14:48 (none) kernel: [ 5339.764777] [<ffffffff8183c672>] entry_SYSCALL_64_fastpath+0x16/0x71
Apr 25 14:14:48 (none) kernel: [ 5339.764779] ---[ end trace ace97b626b47e1f9 ]---

Cherry-picking both 754bd598be9bbc953bc709a9e8ed7f3188bfb9d7 (http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz) and 094f469172e00d6ab0a3130b0e01c83b3cf3a98d (http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz) fix the crash. They appear to be intended as a series.

CVE References

tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Daniel Axtens (daxtens) on 2017-05-03
description: updated
Changed in linux (Ubuntu Xenial):
status: Triaged → Fix Committed
Daniel Axtens (daxtens) wrote :

Hi,

I didn't realise this had hit -proposed; I am in the process of verifying the fix and will let you know ASAP.

Regards,
Daniel

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Jay Vosburgh (jvosburgh) wrote :

Customer has verified that 4.4.0-79-generic resolves the issue in their environment that would previously panic.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (9.7 KiB)

This bug was fixed in the package linux - 4.4.0-79.100

---------------
linux (4.4.0-79.100) xenial; urgency=low

  * linux: 4.4.0-79.100 -proposed tracker (LP: #1691180)

  * linux-aws/linux-gke incorrectly producing and using linux-*-tools-
    common/linux-*-cloud-tools-common (LP: #1688579)
    - [Config] make linux-tools-common and linux-cloud-tools-common provide linux-
      gke versions
    - [Config] make linux-tools-common and linux-cloud-tools-common provide linux-
      aws versions
    - [Packaging] prevent linux-*-tools-common from being produced from non linux
      packages

  * CVE-2017-0605
    - tracing: Use strlcpy() instead of strcpy() in __trace_find_cmdline()

  * i915-bpo crashes on external hdmi input (LP: #1580272)
    - SAUCE: i915_bpo: Silence the warning about watermark entries not changing

  * Kernel panics on Xenial when using cgroups and strict CFS limits
    (LP: #1687512)
    - sched/fair: Initialize throttle_count for new task-groups lazily
    - sched/fair: Do not announce throttled next buddy in dequeue_task_fair()

  * bonding - mlx5 - speed changed to 0 after changing ring size (LP: #1687877)
    - bonding: allow notifications for bond_set_slave_link_state

  * Xenial update to 4.4.67 stable release (LP: #1689296)
    - timerfd: Protect the might cancel mechanism proper
    - Handle mismatched open calls
    - ASoC: intel: Fix PM and non-atomic crash in bytcr drivers
    - ALSA: ppc/awacs: shut up maybe-uninitialized warning
    - drbd: avoid redefinition of BITS_PER_PAGE
    - mtd: avoid stack overflow in MTD CFI code
    - net: tg3: avoid uninitialized variable warning
    - netlink: Allow direct reclaim for fallback allocation
    - IB/qib: rename BITS_PER_PAGE to RVT_BITS_PER_PAGE
    - IB/ehca: fix maybe-uninitialized warnings
    - ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
    - ext4 crypto: revalidate dentry after adding or removing the key
    - ext4 crypto: use dget_parent() in ext4_d_revalidate()
    - ext4/fscrypto: avoid RCU lookup in d_revalidate
    - nfsd4: minor NFSv2/v3 write decoding cleanup
    - nfsd: stricter decoding of write-like NFSv2/v3 ops
    - dm ioctl: prevent stack leak in dm ioctl call
    - Linux 4.4.67

  * Precision Rack failed to resume from S4 (LP: #1686061)
    - x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    - x86/boot: Split out kernel_ident_mapping_init()
    - x86/power/64: Always create temporary identity mapping correctly

  * Xenial update to 4.4.66 stable release (LP: #1688505)
    - f2fs: do more integrity verification for superblock
    - xc2028: unlock on error in xc2028_set_config()
    - ARM: OMAP2+: timer: add probe for clocksources
    - clk: sunxi: Add apb0 gates for H3
    - crypto: testmgr - fix out of bound read in __test_aead()
    - drm/amdgpu: fix array out of bounds
    - ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
    - md:raid1: fix a dead loop when read from a WriteMostly disk
    - MIPS: Fix crash registers on non-crashing CPUs
    - net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndata
    - net_sched: close another race condition in tcf_mirre...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Daniel Axtens (daxtens) on 2017-08-21
Changed in linux (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers