16.04 - Kernel panic on docker server

Bug #1651199 reported by Mike Kaplinskiy
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
linux (Ubuntu)
Confirmed
High
Unassigned
Xenial
Confirmed
High
Unassigned

Bug Description

[31016.057405] BUG: unable to handle kernel paging request at ffff82c4801e6bda
[31016.061249] IP: [<ffffffff814d4ac3>] __xen_evtchn_do_upcall+0x43/0x80
[31016.061380] PGD 0
[31016.061380] Oops: 0010 [#1] SMP
[31016.061380] Modules linked in: binfmt_misc xt_REDIRECT nf_nat_redirect veth xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc overlay xt_nat xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables isofs ppdev serio_raw parport_pc parport ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ixgbevf psmouse floppy
[31016.061380] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0-53-generic #74-Ubuntu
[31016.061380] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[31016.061380] task: ffff880406496040 ti: ffff8804064e0000 task.ti: ffff8804064e0000
[31016.061380] RIP: 0010:[<ffffffff814d4ac3>] [<ffffffff814d4ac3>] __xen_evtchn_do_upcall+0x43/0x80
[31016.061380] RSP: 0018:ffff88040fc43f78 EFLAGS: 00010082
[31016.061380] RAX: 0000000000000000 RBX: ffff88040fc4b840 RCX: 0000000000000001
[31016.061380] RDX: fffffffffffffffe RSI: 00000000000123a0 RDI: ffff88040fc43f30
[31016.061380] RBP: ffff88040fc43f90 R08: 000000000000f24a R09: 0000000000000001
[31016.061380] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
[31016.061380] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8804064e0000
[31016.178419] FS: 0000000000000000(0000) GS:ffff88040fc40000(0000) knlGS:0000000000000000
[31016.178419] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[31016.178419] CR2: ffff82c4801e6bda CR3: 0000000204f2f000 CR4: 00000000001406e0
[31016.178419] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[31016.178419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[31016.178419] Stack:
[31016.178419] 0000000000000000 0000000000000003 0000000000000000 ffff88040fc43fa8
[31016.178419] ffffffff814d6bc0 ffffffff81f36a00 ffff8804064e3e90 ffffffff81837f32
[31016.178419] ffff8804064e3de8 <EOI> ffff8804064e0000 0000000000000000 0000000000000000
[31016.178419] Call Trace:
[31016.178419] <IRQ>
[31016.178419] [<ffffffff814d6bc0>] xen_evtchn_do_upcall+0x30/0x40
[31016.178419] [<ffffffff81837f32>] xen_hvm_callback_vector+0x82/0x90
[31016.178419] <EOI>
[31016.178419] [<ffffffff810645d6>] ? native_safe_halt+0x6/0x10
[31016.178419] [<ffffffff81038d7e>] default_idle+0x1e/0xe0
[31016.239315] [<ffffffff8103958f>] arch_cpu_idle+0xf/0x20
[31016.239899] [<ffffffff810c41ba>] default_idle_call+0x2a/0x40
[31016.239899] [<ffffffff810c4521>] cpu_startup_entry+0x2f1/0x350
[31016.239899] [<ffffffff81051714>] start_secondary+0x154/0x190
[31016.239899] Code: 01 00 00 00 65 44 8b 2d dc 56 b3 7e c6 03 00 44 89 e0 65 0f c1 05 2e d8 b3 7e 85 c0 75 35 48 8b 05 63 ce d0 00 44 89 ef ff 50 50 <9c> 58 0f 1f 44 00 00 f6 c4 02 75 23 65 8b 05 0a d8 b3 7e 65 c7
[31016.239899] RIP [<ffffffff814d4ac3>] __xen_evtchn_do_upcall+0x43/0x80
[31016.239899] RSP <ffff88040fc43f78>
[31016.239899] CR2: ffff82c4801e6bda
[31016.239899] ---[ end trace 5b3e8ea32013e327 ]---
[31016.239899] Kernel panic - not syncing: Fatal exception in interrupt
[31016.239899] Kernel Offset: disabled

We believe this appeared in the last 1-2mo of releases, since this started happening after we did a `apt-get upgrade` on our machines after a bit of a pause. These are EC2 m4.xlarge servers running Ubuntu 16.04.1. They are Kubernetes minions, so I assume docker is likely the trigger.

Unfortunately there aren't really any interesting logs in journalctl from the previous boot prior to the panic.

Let me know what I can do to debug further.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-53-generic 4.4.0-53.74
ProcVersionSignature: Ubuntu 4.4.0-53.74-generic 4.4.30
Uname: Linux 4.4.0-53-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Dec 19 15:56 seq
 crw-rw---- 1 root audio 116, 33 Dec 19 15:56 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
Date: Mon Dec 19 18:12:12 2016
Ec2AMI: ami-f1ca1091
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-west-2c
Ec2InstanceType: m4.xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
JournalErrors:
 Error: command ['journalctl', '-b', '--priority=warning', '--lines=1000'] failed with exit code 1: Hint: You are currently not seeing messages from other users and the system.
       Users in the 'systemd-journal' group can see all messages. Pass -q to
       turn off this notice.
 No journal files were opened due to insufficient permissions.
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Xen HVM domU
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-53-generic root=UUID=6a0a7342-52da-46ce-af2a-92daf9df2224 ro console=tty1 console=ttyS0
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-53-generic N/A
 linux-backports-modules-4.4.0-53-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
WifiSyslog:

dmi.bios.date: 11/11/2016
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd11/11/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen

Revision history for this message
Mike Kaplinskiy (mike-kaplinskiy) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.9 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, is there a prior kernel version you've identified that does not exhibit this bug?

tags: added: kernel-da-key
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Confirmed
status: Confirmed → Incomplete
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Fumihito YOSHIDA (hito) wrote :

This bugs seems reported at upstream https://github.com/docker/docker/issues/27875 , linked.

Changed in linux:
status: Unknown → New
Revision history for this message
Mike Kaplinskiy (mike-kaplinskiy) wrote :

Joseph - I switched one of the 6 machines that previously showed this behavior to the mainline kernel: Linux ip-10-0-32-196.us-west-2.compute.internal 4.9.0-040900-generic #201612111631 SMP Sun Dec 11 21:33:00 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux . I'll let you know the results of this experiment tomorrow.

Fumihito - I saw that docker bug when researching this issue, and I downgraded docker-engine on those machines to 1.12.2. However, that didn't help. Hopefully the mainline kernel test will shed some light on this.

For reference this is my docker config:

$ sudo docker info
Containers: 102
 Running: 13
 Paused: 0
 Stopped: 89
Images: 13
Server Version: 1.12.2
Storage Driver: overlay2
 Backing Filesystem: extfs
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.9.0-040900-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67 GiB
Name: ip-10-0-32-196.us-west-2.compute.internal
ID: TSIL:W3HI:APTU:DSVF:QDEF:HOJ6:PRB7:L7EL:QRQ2:K63C:CH22:RVQR
Docker Root Dir: /data/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 59
 Goroutines: 116
 System Time: 2016-12-20T07:49:15.79205206Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Revision history for this message
Mike Kaplinskiy (mike-kaplinskiy) wrote :

Joseph - I have not identified a version of the kernel without the bug, though we have not seen this behavior from the released kernel ~2mo ago. I can try to pinpoint which kernel release it was, though I'd like to wait and see if the mainline kernel panics as well before bisecting. Unfortunately these happen several times a day so the turnaround cycle is a bit long.

Revision history for this message
Mitchel Humpherys (mgalgs) wrote :

I'm seeing the same thing on an m4.16xlarge EC2 instance. It has crashed twice in the past 24 hours.

Revision history for this message
Mitchel Humpherys (mgalgs) wrote :

BTW, I'm seeing this same bug (same symptom at least; the exact same call stack in the kernel panic) and I'm *not* using docker. The main workload on the machine where I'm seeing this is Postgres.

Revision history for this message
Mike Kaplinskiy (mike-kaplinskiy) wrote :

My canary box is running 4.9.0-040900.201612111631 now and hasn't rebooted in ~1.5 days. I'll roll this out further, but this may be fixed in mainline.

Revision history for this message
Mike Kaplinskiy (mike-kaplinskiy) wrote :

I also noticed that one of the boxes wasn't upgraded (for whatever reason) and has not had any panics in the last few days. The kernel version on that box is: Linux ip-10-0-26-30.us-west-2.compute.internal 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

tags: added: kernel-fixed-upstream
Revision history for this message
Mike Kaplinskiy (mike-kaplinskiy) wrote :

After upgrading to mainline builds everywhere, I haven't seen a restart in a least a day. Seems this is fixed in the mainline kernel.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Xenial):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you see if the latest 4.4 upstream kernel also fixes this bug? That will tell us if the fix in mainline was already cc'd to stable. It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.42/

Revision history for this message
Mitchel Humpherys (mgalgs) wrote :

Has this been fixed in any of the newer linux-image packages? I, unfortunately, can't offer to test since I was only seeing this on my production database server... For now I've just frozen my kernel version to 4.4.0-38.

Revision history for this message
Michael Bittorf (codingminds) wrote :

Same problem here. Server died frequently with kernel version >4.4.0-38, =4.13.0-45 and =4.15.0-23.
After installing linux-image-4.4.42-040442-generic_4.4.42-040442.201701120631_amd64.deb the server seems to be stable.
> docker01:~ $ uname -a
> Linux docker01 4.4.42-040442-generic #201701120631 SMP Thu Jan 12 11:32:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I'll let it run with this kernel and report if the problem occurs again.

Revision history for this message
Michael Bittorf (codingminds) wrote :

No problems since installation of "4.4.42-040442-generic #201701120631". Everything is fine

Changed in linux:
status: New → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.