4.4.0-83-generic + Docker + EC2 frequently crashes at cgroup_rmdir GPF

Bug #1702665 reported by sorah on 2017-07-06
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
docker (Ubuntu)
Undecided
Unassigned
linux (Ubuntu)
High
Unassigned

Bug Description

We run xenial-based Docker container hosts on EC2 with Amazon ECS. Recently we refreshed our base image, we started to see frequent panic.

Hosts run Amazon ECS Agent, and the agent automatically creates or destroys Docker container based on requests onto ECS cluster.

I think this crash is caused by Docker-related operations, because crashing at cgroups.

Also, we're running several different cluster with another EC2 instance types, using same image. This problem is only reproducing at t2.small instances. (We also run c4.large and m4.* clusters)

Our previous image ran 4.4.0-79-generic, and we see no problem with 79.

[30558.783899] general protection fault: 0000 [#1] SMP
[30558.784056] Modules linked in: veth binfmt_misc xt_nat xt_comment xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc isofs ppdev input_leds serio_raw parport_pc parport autofs4 btrfs xor raid6_pq crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy
[30558.784056] CPU: 0 PID: 1 Comm: systemd Not tainted 4.4.0-83-generic #106-Ubuntu
[30558.784056] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
[30558.784056] task: ffff88007c4e8000 ti: ffff88007c4e4000 task.ti: ffff88007c4e4000
[30558.784056] RIP: 0010:[<ffffffff811193ff>] [<ffffffff811193ff>] cgroup_destroy_locked+0x5f/0xf0
[30558.784056] RSP: 0018:ffff88007c4e7e40 EFLAGS: 00010212
[30558.784056] RAX: ffff8800114481bd RBX: ffff88002827ba50 RCX: ffff88007ab8d150
[30558.784056] RDX: 00111e7e00ffff88 RSI: ffff88002827ba54 RDI: ffffffff8217745c
[30558.784056] RBP: ffff88007c4e7e60 R08: 0000000000000020 R09: ffff88007c4e7e70
[30558.784056] R10: 000000000637760b R11: ffff880011829a80 R12: ffff88007ab8d000
[30558.784056] R13: 0000000000000000 R14: 0000559b48b2dcc0 R15: 00000000ffffff9c
[30558.784056] FS: 00007f29cf0db8c0(0000) GS:ffff88007d200000(0000) knlGS:0000000000000000
[30558.784056] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30558.784056] CR2: 00007fa466d58180 CR3: 000000007c190000 CR4: 00000000001406f0
[30558.784056] Stack:
[30558.784056] ffff88002827ba50 ffff88002827ba50 ffff8800373e70d0 0000559b48b2dcc0
[30558.784056] ffff88007c4e7e80 ffffffff811194b3 ffff88002827ba50 0000000000000000
[30558.784056] ffff88007c4e7ea0 ffffffff8128ddcd ffff880011829a80 0000000000000000
[30558.784056] Call Trace:
[30558.784056] [<ffffffff811194b3>] cgroup_rmdir+0x23/0x40
[30558.784056] [<ffffffff8128ddcd>] kernfs_iop_rmdir+0x4d/0x80
[30558.784056] [<ffffffff8121b134>] vfs_rmdir+0xb4/0x130
[30558.784056] [<ffffffff8121f83f>] do_rmdir+0x1df/0x200
[30558.784056] [<ffffffff81220546>] SyS_rmdir+0x16/0x20
[30558.784056] [<ffffffff81840b72>] entry_SYSCALL_64_fastpath+0x16/0x71
[30558.784056] Code: 74 fd 48 c7 c7 5c 74 17 82 e8 8e 72 72 00 49 8b 94 24 50 01 00 00 49 8d 8c 24 50 01 00 00 48 39 d1 48 8d 42 f0 74 18 48 8b 50 08 <c6> 82 b0 01 00 00 01 48 8b 50 10 48 39 d1 48 8d 42 f0 75 e8 49
[30558.784056] RIP [<ffffffff811193ff>] cgroup_destroy_locked+0x5f/0xf0
[30558.784056] RSP <ffff88007c4e7e40>
[30558.960828] ---[ end trace 7634e03ff94e8934 ]---
[30558.964811] Kernel panic - not syncing: Fatal exception in interrupt
[30558.968805] Kernel Offset: disabled

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-83-generic 4.4.0-83.106
ProcVersionSignature: Ubuntu 4.4.0-83.106-generic 4.4.70
Uname: Linux 4.4.0-83-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jul 6 10:22 seq
 crw-rw---- 1 root audio 116, 33 Jul 6 10:22 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.6
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
Date: Thu Jul 6 10:27:37 2017
Ec2AMI: ami-34100353
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: ap-northeast-1c
Ec2InstanceType: t2.small
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Xen HVM domU
PciMultimedia:

ProcEnviron:
 SHELL=/bin/bash
 TERM=screen-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-83-generic root=UUID=f76be987-234f-4071-87d4-06318cfc2135 ro cgroup_enable=memory swapaccount=1 console=tty1 console=ttyS0
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-83-generic N/A
 linux-backports-modules-4.4.0-83-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
WifiSyslog:

dmi.bios.date: 02/16/2017
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd02/16/2017:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen

sorah (sorah) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
sorah (sorah) wrote :

Created new instance and bumped to the 4.12 latest.

$ uname -a
Linux ${HOSTNAME} 4.12.0-041200-generic #201707022031 SMP Mon Jul 3 00:32:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

sorah (sorah) wrote :
Download full text (3.4 KiB)

We've switched most of instances in the cluster to c4.large instance type, but another panic occur. Panic happened at different trace, so this may not relate to the first GPF, but pasting log below for information.

Also, t2.small 4.12.x instance is now running 16 hours+, we have not seen panics yet. still keeping eyes how it goes.

[ 7236.612035] BUG: unable to handle kernel paging request at 000000010000000d
[ 7236.614750] IP: [<ffffffff81218247>] free_pipe_info+0x57/0x90
[ 7236.615155] PGD 0
[ 7236.615155] Oops: 0000 [#1] SMP
[ 7236.615155] Modules linked in: veth binfmt_misc xt_nat xt_comment xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs isofs ppdev input_leds serio_raw parport_pc parport autofs4 btrfs xor raid6_pq crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul psmouse glue_helper ablk_helper cryptd ixgbevf floppy
[ 7236.615155] CPU: 0 PID: 1 Comm: systemd Not tainted 4.4.0-83-generic #106-Ubuntu
[ 7236.615155] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
[ 7236.615155] task: ffff8800eabf0000 ti: ffff8800eabec000 task.ti: ffff8800eabec000
[ 7236.615155] RIP: 0010:[<ffffffff81218247>] [<ffffffff81218247>] free_pipe_info+0x57/0x90
[ 7236.615155] RSP: 0018:ffff8800eabefdf8 EFLAGS: 00010202
[ 7236.615155] RAX: 00000000fffffffd RBX: 0000000000000008 RCX: 000000000000012c
[ 7236.615155] RDX: 0000000000000028 RSI: ffff88005d5cc940 RDI: ffff8800e9840180
[ 7236.615155] RBP: ffff8800eabefe08 R08: 0000000000000000 R09: 0000000000000000
[ 7236.615155] R10: ffff8800e9e125b8 R11: ffff8800b5fc8510 R12: ffff8800e9840180
[ 7236.615155] R13: ffff8800e9e125b8 R14: ffff8800eab44c20 R15: ffff8800e9ee80c0
[ 7236.615155] FS: 00007fb3ddc0c8c0(0000) GS:ffff8800eb600000(0000) knlGS:0000000000000000
[ 7236.615155] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7236.615155] CR2: 000000010000000d CR3: 000000003641f000 CR4: 00000000001406f0
[ 7236.615155] Stack:
[ 7236.615155] ffff8800e9e12640 ffff8800e9840180 ffff8800eabefe30 ffffffff812182dc
[ 7236.615155] ffff8800e9840180 ffff8800b5fc8500 ffff8800e9e125b8 ffff8800eabefe58
[ 7236.615155] ffffffff81218390 ffff8800b5fc8500 0000000000000008 ffff8800e9e125b8
[ 7236.615155] Call Trace:
[ 7236.615155] [<ffffffff812182dc>] put_pipe_info+0x5c/0x70
[ 7236.615155] [<ffffffff81218390>] pipe_release+0xa0/0xb0
[ 7236.615155] [<ffffffff81210f34>] __fput+0xe4/0x220
[ 7236.615155] [<ffffffff812110ae>] ____fput+0xe/0x10
[ 7236.615155] [<ffffffff8109f031>] task_work_run+0x81/0xa0
[ 7236.615155] [<ffffffff81003242>] exit_to_usermode_loop+0xc2/0xd0
[ 7236.615155] [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[ 7236.615155] [<ffffffff81840cd0>] int_ret_from_sys_call+0x25/0x8f
[ 7236.615155] Code: 4a e7 ff 41 8b 44 24 48 85 c0 74 2c 48 63 c3 48 8d 14 80 49 8b 84 24 80 00 00 00 48 8d 34 d0 48 8b 46 10 48 85 c0 74 06 4c 89 e7 <ff> 50 10 83 c3 01 41 39 5c 24 48 77 d4 49 8b 7c 24 68 48 85 ff
[ 7236.615155] RIP [<ffffffff81218247>] free_pipe_info+0x57/0x90
[ 7236.615155] RSP <ffff8800eabefdf8...

Read more...

sorah (sorah) wrote :

Noticed that these log appears before panic occurs for both kind of panics (GPF and "unable to handle kernel paging request").

[ 210.064089] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 1260.052073] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 3042.312072] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 4330.228094] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 4340.484089] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 5430.912074] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 7233.044085] unregister_netdevice: waiting for lo to become free. Usage count = 1

sorah (sorah) wrote :
Download full text (9.1 KiB)

Additional logs from 4.4.0 machine by running `dmesg -w` until crash.

[ 3031.276097] ------------[ cut here ]------------
[ 3031.276106] WARNING: CPU: 0 PID: 31804 at /build/linux-0uniEn/linux-4.4.0/net/ipv6/addrconf_core.c:159 in6_dev_finish_destroy+0x6b/0xc0()
[ 3031.276108] Modules linked in: veth binfmt_misc xt_nat xt_comment xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs isofs ppdev input_leds serio_raw
 parport_pc parport autofs4 btrfs xor raid6_pq crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy
[ 3031.276131] CPU: 0 PID: 31804 Comm: kworker/u30:3 Not tainted 4.4.0-83-generic #106-Ubuntu
[ 3031.276132] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
[ 3031.276136] Workqueue: netns cleanup_net
[ 3031.276137] 0000000000000286 00000000e9d24b1b ffff880043383ba0 ffffffff813f9513
[ 3031.276140] 0000000000000000 ffffffff81d75940 ffff880043383bd8 ffffffff81081322
[ 3031.276141] ffff880070297800 0000000000000000 0000000000000006 ffff880043383ca8
[ 3031.276143] Call Trace:
[ 3031.276148] [<ffffffff813f9513>] dump_stack+0x63/0x90
[ 3031.276151] [<ffffffff81081322>] warn_slowpath_common+0x82/0xc0
[ 3031.276153] [<ffffffff8108146a>] warn_slowpath_null+0x1a/0x20
[ 3031.276154] [<ffffffff8181ba0b>] in6_dev_finish_destroy+0x6b/0xc0
[ 3031.276157] [<ffffffff817f17e6>] ip6_route_dev_notify+0x116/0x130
[ 3031.276159] [<ffffffff810a1b1a>] notifier_call_chain+0x4a/0x70
[ 3031.276161] [<ffffffff810a1c96>] raw_notifier_call_chain+0x16/0x20
[ 3031.276163] [<ffffffff8172f655>] call_netdevice_notifiers_info+0x35/0x60
[ 3031.276165] [<ffffffff81739e7d>] netdev_run_todo+0x16d/0x320
[ 3031.276168] [<ffffffff817316c9>] ? rollback_registered_many+0x2c9/0x340
[ 3031.276171] [<ffffffff8174540e>] rtnl_unlock+0xe/0x10
[ 3031.276173] [<ffffffff81732a37>] default_device_exit_batch+0x147/0x170
[ 3031.276176] [<ffffffff810c4140>] ? __wake_up_sync+0x20/0x20
[ 3031.276178] [<ffffffff8172a892>] ops_exit_list.isra.4+0x52/0x60
[ 3031.276179] [<ffffffff8172b8e2>] cleanup_net+0x1c2/0x2a0
[ 3031.276182] [<ffffffff8109a585>] process_one_work+0x165/0x480
[ 3031.276184] [<ffffffff8109a8eb>] worker_thread+0x4b/0x4c0
[ 3031.276186] [<ffffffff8109a8a0>] ? process_one_work+0x480/0x480
[ 3031.276187] [<ffffffff810a0c25>] kthread+0xe5/0x100
[ 3031.276189] [<ffffffff810a0b40>] ? kthread_create_on_node+0x1e0/0x1e0
[ 3031.276192] [<ffffffff81840f0f>] ret_from_fork+0x3f/0x70
[ 3031.276193] [<ffffffff810a0b40>] ? kthread_create_on_node+0x1e0/0x1e0
[ 3031.276194] ---[ end trace 5d921dfe814c9c7c ]---
[ 3031.276195] ------------[ cut here ]------------
[ 3031.276197] WARNING: CPU: 0 PID: 31804 at /build/linux-0uniEn/linux-4.4.0/net/ipv6/addrconf_core.c:160 in6_dev_finish_destroy+0x83/0xc0()
[ 3031.276198] Modules linked in: veth binfmt_misc xt_nat xt_comment xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tabl...

Read more...

summary: - 4.4.0-83-generic + Docker + EC2 t2.small frequently crashes at
- cgroup_rmdir GPF
+ 4.4.0-83-generic + Docker + EC2 frequently crashes at cgroup_rmdir GPF
sorah (sorah) wrote :

4.12 instance is still running without crash, survived this weekend.

Added kernel-fixed-upstream tag.

tags: added: kernel-fixed-upstream
Joseph Salisbury (jsalisbury) wrote :

Can you also give the latest upstream 4.4 kernel a test? It is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.76/

sorah (sorah) wrote :

Sure. Upgraded one of an instance running 4.4.76-040476-generic, aside of existing 4.12 instance.

sorah (sorah) wrote :

Clarify: Upgraded one of an instance running 4.4.0-83 to 4.4.76-040476-generic, *

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in docker (Ubuntu):
status: New → Confirmed
sorah (sorah) wrote :

Not seeing crashes at 4.4.76-040476-generic for these few days.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers