Ubuntu
linux package

Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced Networking

Bug #1534345 reported by Will Buckner on 2016-01-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Won't Fix	Critical	Naeil Ezzoueidi

Bug Description

Lots of details and history of the problem here: https://askubuntu.com/questions/710747/after-upgrading-to-15-10-from-15-04-ec2-webservers-have-become-very-unstable

10 of my webservers have started crashing immediately following the 15.10 upgrade. As far as what exactly defines a "crash", Instance Status Checks fail, and I can no longer SSH to the machine. Background daemons running on the system stop responding, and nothing is written to the logs.

After weeks of working with the AWS team, I finally fixed a netconsole issue via "echo 7 > /proc/sys/kernel/printk" and got netconsole working properly, and finally have a trace:

[21410.260077] general protection fault: 0000 [#1] SMP
[21410.261976] Modules linked in: isofs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev intel_rapl iosf_mbi xen_fbfront fb_sys_fops input_leds serio_raw i2c_piix4 parport_pc 8250_fintek parport mac_hid netconsole configfs autofs4 crct10dif_pclmul crc32_pclmul cirrus syscopyarea sysfillrect sysimgblt aesni_intel ttm aes_x86_64 drm_kms_helper lrw gf128mul glue_helper ablk_helper cryptd psmouse drm ixgbevf pata_acpi floppy
[21410.264054] CPU: 0 PID: 26957 Comm: apache2 Not tainted 4.2.0-23-generic #28-Ubuntu
[21410.264054] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[21410.264054] task: ffff8803f9809b80 ti: ffff8803f999c000 task.ti: ffff8803f999c000
[21410.264054] RIP: 0010:[<ffffffff810e5c36>] [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0
[21410.264054] RSP: 0000:ffff8803ff203e98 EFLAGS: 00010086
[21410.264054] RAX: dead000000200200 RBX: ffff8803ff20e9c0 RCX: ffff8803ff203ec8
[21410.264054] RDX: ffff8803ff203ec8 RSI: 0000000000011fc0 RDI: ffff8803ff20e9c0
[21410.264054] RBP: ffff8803ff203f08 R08: 000000000000a77a R09: 0000000000000000
[21410.264054] R10: 0000000000000020 R11: 0000000000000004 R12: 000000000000007c
[21410.264054] R13: ffffffff8172aaf0 R14: 0000000000000000 R15: ffff8803af955be0
[21410.264054] FS: 00007fb0ce6e8780(0000) GS:ffff8803ff200000(0000) knlGS:0000000000000000
[21410.264054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21410.264054] CR2: 00007fb0ce51e130 CR3: 00000003fb233000 CR4: 00000000001406f0
[21410.264054] Stack:
[21410.264054] ffff8803ff203eb8 ffff8803ff20f5f8 ffff8803ff20f3f8 ffff8803ff20f1f8
[21410.264054] ffff8803ff20e9f8 ffff8803af955b58 dead000000200200 00000000f60fabc0
[21410.264054] 0000000000011fc0 0000000000000001 ffffffff81c0b0c8 0000000000000001
[21410.264054] Call Trace:
[21410.264054] <IRQ>
[21410.264054] [<ffffffff8107f846>] __do_softirq+0xf6/0x250
[21410.264054] [<ffffffff8107fb13>] irq_exit+0xa3/0xb0
[21410.264054] [<ffffffff814a4499>] xen_evtchn_do_upcall+0x39/0x50
[21410.264054] [<ffffffff817f1f6b>] xen_hvm_callback_vector+0x6b/0x70
[21410.264054] <EOI>
[21410.264054] Code: 81 e6 00 00 20 00 48 85 d2 48 89 45 b8 0f 85 30 01 00 00 4c 89 7b 08 0f 1f 44 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 10 48 b8 00 02 20 00 00 00 ad de 49 c7
[21410.264054] RIP [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0
[21410.264054] RSP <ffff8803ff203e98>

I don't have a vmcore at the moment, but I'm trying to get one from AWS and should have one in the next couple of days. This is happening frequently and repeatedly since I first upgraded to 15.10 on early December.

ubuntu@xxx-web-xx:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 15.10
Release: 15.10
Codename: wily
ubuntu@xxx-web-xx:~$ uname -a
Linux xxx-web-xx 4.2.0-23-generic #28-Ubuntu SMP Sun Dec 27 17:47:31 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@xxx-web-xx:~$

ProblemType: Bug
DistroRelease: Ubuntu 15.10
Package: linux-image-4.2.0-23-generic 4.2.0-23.28
ProcVersionSignature: User Name 4.2.0-23.28-generic 4.2.6
Uname: Linux 4.2.0-23-generic x86_64
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Jan 14 15:42 seq
crw-rw---- 1 root audio 116, 33 Jan 14 15:42 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.19.1-0ubuntu5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu Jan 14 21:31:14 2016
Ec2AMI: ami-d5e7adbf
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1d
Ec2InstanceType: m4.xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
MachineType: Xen HVM domU
PciMultimedia:

ProcEnviron:
TERM=xterm-256color
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB:
0 cirrusdrmfb
1 xen
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.2.0-23-generic root=UUID=9bd55602-81dd-4868-8cfc-b7d63f8f8d7e ro console=tty1 console=ttyS0 crashkernel=256M@0M
RelatedPackageVersions:
linux-restricted-modules-4.2.0-23-generic N/A
linux-backports-modules-4.2.0-23-generic N/A
linux-firmware 1.149.3
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev'
UpgradeStatus: Upgraded to wily on 2015-12-15 (29 days ago)
dmi.bios.date: 12/07/2015
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/07/2015:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen

Tags:

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-14:

CRDA.txt Edit (422 bytes, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (41.1 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.6 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (4.4 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (3.6 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (3.3 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (4.4 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (2.8 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (107.7 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (53.1 KiB, text/plain; charset="utf-8")

Revision history for this message

Brad Figg (brad-figg) wrote on 2016-01-14: Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-14:

Download full text (3.3 KiB)

And now we've got a second trace from the same machine with a bit more info:

And one more:

[14032.676085] general protection fault: 0000 [#1] SMP
[14032.678409] Modules linked in: isofs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev xen_fbfront fb_sys_fops intel_rapl iosf_mbi input_leds i2c_piix4 parport_pc 8250_fintek serio_raw parport mac_hid netconsole configfs autofs4 cirrus syscopyarea sysfillrect sysimgblt ttm crct10dif_pclmul crc32_pclmul aesni_intel drm_kms_helper aes_x86_64 lrw gf128mul glue_helper ablk_helper drm cryptd ixgbevf psmouse pata_acpi floppy
[14032.680061] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-23-generic #28-Ubuntu
[14032.680061] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[14032.680061] task: ffffffff81c13500 ti: ffffffff81c00000 task.ti: ffffffff81c00000
[14032.680061] RIP: 0010:[<ffffffff810e5c36>] [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0
[14032.680061] RSP: 0018:ffff8803ff203e98 EFLAGS: 00010086
[14032.680061] RAX: dead000000200200 RBX: ffff8803ff20e9c0 RCX: 00000000000e1785
[14032.680061] RDX: ffff8803ff203ec8 RSI: ffff8803ff21be00 RDI: ffff8803ff20e9c0
[14032.680061] RBP: ffff8803ff203f08 R08: 000000000001be00 R09: 0000000000000000
[14032.680061] R10: ffffea00036dc680 R11: 0000000000000000 R12: 00000000000000d2
[14032.680061] R13: ffffffff8172aaf0 R14: 0000000000000000 R15: ffff8800db71a3d0
[14032.680061] FS: 0000000000000000(0000) GS:ffff8803ff200000(0000) knlGS:0000000000000000
[14032.680061] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14032.680061] CR2: 00007f97cd08ec64 CR3: 00000003f5834000 CR4: 00000000001406f0
[14032.680061] Stack:
[14032.680061] ffff8803ff203eb8 ffff8803ff20f5f8 ffff8803ff20f3f8 ffff8803ff20f1f8
[14032.680061] ffff8803ff20e9f8 ffff8800db71a348 dead000000200200 08c9e9276e6f82e9
[14032.680061] 0000000000011fc0 0000000000000001 ffffffff81c0b0c8 0000000000000001
[14032.680061] Call Trace:
[14032.680061] <IRQ>
[14032.680061] [<ffffffff8107f846>] __do_softirq+0xf6/0x250
[14032.680061] [<ffffffff8107fb13>] irq_exit+0xa3/0xb0
[14032.680061] [<ffffffff814a4499>] xen_evtchn_do_upcall+0x39/0x50
[14032.680061] [<ffffffff817f1f6b>] xen_hvm_callback_vector+0x6b/0x70
[14032.680061] <EOI>
[14032.680061] [<ffffffff810e7542>] ? get_next_timer_interrupt+0xf2/0x240
[14032.680061] [<ffffffff81060526>] ? native_safe_halt+0x6/0x10
[14032.680061] [<ffffffff8101f1fe>] default_idle+0x1e/0xa0
[14032.680061] [<ffffffff8101f9af>] arch_cpu_idle+0xf/0x20
[14032.680061] [<ffffffff810bd4aa>] default_idle_call+0x2a/0x40
[14032.680061] [<ffffffff810bd7e9>] cpu_startup_entry+0x2c9/0x320
[14032.680061] [<ffffffff817dda2c>] rest_init+0x7c/0x80
[14032.680061] [<ffffffff81d50025>] start_kernel+0x48b/0x4ac
[14032.680061] [<ffffffff81d4f120>] ? early_idt_handler_array+0x120/0x120
[14032.680061] [<ffffffff81d4f339>] x86_64_start_reservations+0x2a/0x2c
[14032.680061] [<ffffffff81d4f485>] x86_64_start_kernel+0x14a/0x16d
[14032.680061] Code: 81 e6 00 00 20 00 48 85 d2 48 89 45 b8 0f 85 30 01 00 00 4c 89 7b 08 0f 1f 44...

And now we've got a second trace from the same machine with a bit more info:

And one more:

[14032.676085] general protection fault: 0000 [#1] SMP
[14032.678409] Modules linked in: isofs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev xen_fbfront fb_sys_fops intel_rapl iosf_mbi input_leds i2c_piix4 parport_pc 8250_fintek serio_raw parport mac_hid netconsole configfs autofs4 cirrus syscopyarea sysfillrect sysimgblt ttm crct10dif_pclmul crc32_pclmul aesni_intel drm_kms_helper aes_x86_64 lrw gf128mul glue_helper ablk_helper drm cryptd ixgbevf psmouse pata_acpi floppy
[14032.680061] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-23-generic #28-Ubuntu
[14032.680061] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[14032.680061] task: ffffffff81c13500 ti: ffffffff81c00000 task.ti: ffffffff81c00000
[14032.680061] RIP: 0010:[<ffffffff810e5c36>]  [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0
[14032.680061] RSP: 0018:ffff8803ff203e98  EFLAGS: 00010086
[14032.680061] RAX: dead000000200200 RBX: ffff8803ff20e9c0 RCX: 00000000000e1785
[14032.680061] RDX: ffff8803ff203ec8 RSI: ffff8803ff21be00 RDI: ffff8803ff20e9c0
[14032.680061] RBP: ffff8803ff203f08 R08: 000000000001be00 R09: 0000000000000000
[14032.680061] R10: ffffea00036dc680 R11: 0000000000000000 R12: 00000000000000d2
[14032.680061] R13: ffffffff8172aaf0 R14: 0000000000000000 R15: ffff8800db71a3d0
[14032.680061] FS:  0000000000000000(0000) GS:ffff8803ff200000(0000) knlGS:0000000000000000
[14032.680061] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14032.680061] CR2: 00007f97cd08ec64 CR3: 00000003f5834000 CR4: 00000000001406f0
[14032.680061] Stack:
[14032.680061]  ffff8803ff203eb8 ffff8803ff20f5f8 ffff8803ff20f3f8 ffff8803ff20f1f8
[14032.680061]  ffff8803ff20e9f8 ffff8800db71a348 dead000000200200 08c9e9276e6f82e9
[14032.680061]  0000000000011fc0 0000000000000001 ffffffff81c0b0c8 0000000000000001
[14032.680061] Call Trace:
[14032.680061]  <IRQ>
[14032.680061]  [<ffffffff8107f846>] __do_softirq+0xf6/0x250
[14032.680061]  [<ffffffff8107fb13>] irq_exit+0xa3/0xb0
[14032.680061]  [<ffffffff814a4499>] xen_evtchn_do_upcall+0x39/0x50
[14032.680061]  [<ffffffff817f1f6b>] xen_hvm_callback_vector+0x6b/0x70
[14032.680061]  <EOI>
[14032.680061]  [<ffffffff810e7542>] ? get_next_timer_interrupt+0xf2/0x240
[14032.680061]  [<ffffffff81060526>] ? native_safe_halt+0x6/0x10
[14032.680061]  [<ffffffff8101f1fe>] default_idle+0x1e/0xa0
[14032.680061]  [<ffffffff8101f9af>] arch_cpu_idle+0xf/0x20
[14032.680061]  [<ffffffff810bd4aa>] default_idle_call+0x2a/0x40
[14032.680061]  [<ffffffff810bd7e9>] cpu_startup_entry+0x2c9/0x320
[14032.680061]  [<ffffffff817dda2c>] rest_init+0x7c/0x80
[14032.680061]  [<ffffffff81d50025>] start_kernel+0x48b/0x4ac
[14032.680061]  [<ffffffff81d4f120>] ? early_idt_handler_array+0x120/0x120
[14032.680061]  [<ffffffff81d4f339>] x86_64_start_reservations+0x2a/0x2c
[14032.680061]  [<ffffffff81d4f485>] x86_64_start_kernel+0x14a/0x16d
[14032.680061] Code: 81 e6 00 00 20 00 48 85 d2 48 89 45 b8 0f 85 30 01 00 00 4c 89 7b 08 0f 1f 44 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 10 48 b8 00 02 20 00 00 00 ad de 49 c7
[14032.680061] RIP  [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0
[14032.680061]  RSP <ffff8803ff203e98>

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-16:

Do you guys need a vmcore? I'm working on getting one from AWS.

Alberto Salvia Novella (es20490446e) on 2016-01-16

Changed in linux (Ubuntu):
importance:	Undecided → Critical

Alberto Salvia Novella (es20490446e) on 2016-01-18

Changed in linux (Ubuntu):
assignee:	nobody → Alberto Salvia Novella (es20490446e)
status:	Confirmed → In Progress

Alberto Salvia Novella (es20490446e) on 2016-01-18

Changed in linux (Ubuntu):
status:	In Progress → Triaged
assignee:	Alberto Salvia Novella (es20490446e) → nobody

Naeil Ezzoueidi (naeilzoueidi) on 2016-01-18

Changed in linux (Ubuntu):
assignee:	nobody → Na3iL (naeilzoueidi)

Revision history for this message

Robert C Jennings (rcj) wrote on 2016-01-19:

Leann, can the kernel team take a look at this bug?

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-20:

Looking at both traces this looks to be consistently happen inside run_timer_softirq() and from the offset I would guess we are in the inlined __run_timers. Another noteworthy part is the value of RAX. This is the value of LIST_POISON2 which is used to mark an invalid pointer of a (hlist_node *)->pprev.
So I would guess something modified the list of pending timers (those exist per-cpu) while softirq processing was working on them. The problem is to say what. Not sure a dump will help as often in those races the clues go away just after causing problems.
I would maybe suspect the area of xen-netfront, given that, as far as I can tell, this has not happened on bare-metal servers and from the description rather seems to affect high traffic instances.
Would it be possible to volunteer one affected instance and try mainline kernels (https://wiki.ubuntu.com/Kernel/MainlineBuilds) between 3.19 and 4.2 (4.0, 4.1) and/or after (4.3, maybe 4.4)? That would give a smaller delta to look at for what broke things (using the 4.0 and 4.1 kernels) or whether maybe it got fixed but not identified as a stable patch (when using 4.3 or 4.4).

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-20:

Thanks for looking into this Stefan! We were completely fine with 15.05 and 3.19. If it won't break anything terribly, I can try to put 3.19, 4.0, and 4.1 on these machines, but each one crashes every 24-48 hours, so it might take me several days. Which kernel would you recommend starting with, say, 4.0 or 4.4?

Another thing that I didn't find relevant before, but seems to confirm what you're saying about the per-CPU timers--AWS told me the following after a crash where I disabled my auto-reboot-on-alarm triggers:

I was able to successfully get a trace - most of the vCPU were just in a halted state, so nothing there, but one had some potentially useful information:

++++++++++++++++++++
VCPU 1
rip: ffffffff810c3ef5 __pv_queued_spin_lock_slowpath+0xc5
flags: 00000206 i nz p
rsp: ffff8803ff243e78
rax: 0000000000000a2a rcx: 00000000fffffffa rdx: 0000000000000003
rbx: ffff8803f7ef2e38 rsi: ffff8803ff243df8 rdi: 0000000000000003
rbp: ffff8803ff243ea8 r8: 0000000000000000 r9: ffff8803fe800000
r10: 0000000000000000 r11: ffffffff813ef2b0 r12: ffff8803ff2571c0
r13: 0000000000080000 r14: ffff88040ffa30c0 r15: 0000000000000001
cs: 0010 ss: 0000 ds: 0000 es: 0000
fs: 0000 @ 00007fc1867b8700
gs: 0000 @ ffff8803ff240000/0000000000000000

cr0: 80050033
cr2: 000000a8
cr3: de15f000
cr4: 001406e0

dr0: 00000000
dr1: 00000000
dr2: 00000000
dr3: 00000000
dr6: ffff0ff0
dr7: 00000400
Code (instr addr ffffffff810c3ef5)
41 bf 01 00 00 00 48 0f af c3 48 89 45 d0 b8 00 80 00 00 eb 0b <f3> 90 83 e8 01 0f 84 d4 00 00 00

Stack:
8c2fa8473f0f2e38 ffff8803ff2577c0 ffff8803f7ef2e10 0000000000000000
ffff8803f7ef2e10 0000000101155691 ffff8803ff243eb8 ffffffff817f0021
ffff8803ff243f38 ffffffff816e48f4 0000000101155693 000000400000012c
0000000000000024 ffff8803ff243ee0 ffff8803ff243ee0 ffff8803ff243ef0

Call Trace:
  [<ffffffff810c3ef5>] __pv_queued_spin_lock_slowpath+0xc5 <--
  [<ffffffff817f0021>] _raw_spin_lock+0x21
  [<ffffffff816e48f4>] net_rx_action+0xe4
  [<ffffffff8107f846>] __do_softirq+0xf6
  [<ffffffff817f1ddc>] do_softirq_own_stack+0x1c
++++++++++++++++++++

I was able to successfully get a trace - most of the vCPU were just in a halted state, so nothing there, but one had some potentially useful information:

++++++++++++++++++++
VCPU 1
rip: ffffffff810c3ef5 __pv_queued_spin_lock_slowpath+0xc5
flags: 00000206 i nz p
rsp: ffff8803ff243e78
rax: 0000000000000a2a	rcx: 00000000fffffffa	rdx: 0000000000000003
rbx: ffff8803f7ef2e38	rsi: ffff8803ff243df8	rdi: 0000000000000003
rbp: ffff8803ff243ea8	 r8: 0000000000000000	 r9: ffff8803fe800000
r10: 0000000000000000	r11: ffffffff813ef2b0	r12: ffff8803ff2571c0
r13: 0000000000080000	r14: ffff88040ffa30c0	r15: 0000000000000001
cs: 0010	 ss: 0000	 ds: 0000	 es: 0000
fs: 0000 @ 00007fc1867b8700
gs: 0000 @ ffff8803ff240000/0000000000000000

cr0: 80050033
cr2: 000000a8
cr3: de15f000
cr4: 001406e0

Call Trace:
  [<ffffffff810c3ef5>] __pv_queued_spin_lock_slowpath+0xc5  <--
  [<ffffffff817f0021>] _raw_spin_lock+0x21
  [<ffffffff816e48f4>] net_rx_action+0xe4
  [<ffffffff8107f846>] __do_softirq+0xf6
  [<ffffffff817f1ddc>] do_softirq_own_stack+0x1c
++++++++++++++++++++

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-20:

Hi Will and thanks for volunteering. That backtrace at least looks to confirm the educated guess about being related to network. *If* it actually is related, then I could imagine that processing incoming network traffic in softirq context of cpu#1 might cancel a timer which was set to wait for a packet to arrive and maybe that at or around the same time expired. Still hard to say for sure.
Looking at the xen-netfont driver between 3.19 and 4.2 there was not any change that stuck out as suspicious but then it might as well be a change in the network stack which the xen driver would need to adapt for.
About the mainline kernels. While I never would *promise* nothing horrible to happen, I would not expect anything so bad that there was not a chance to switch back to a different kernel. The mainline kernels miss some of the few special drivers (mostly overlayfs) which I don't think you have in use. As you say 3.19 is ok, I would suggest to go for 4.0.9 first and 4.1.15 after that. Both contain the latest upstream stable patches.

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-20:

Note that the following is not final statement but sharing some thoughts to whoever else is looking at this report (and for me to remember). So while I did find nothing that really looked odd in the xen-netfront code I saw there was some change to the generic timer code:

commit 1dabbcec2c0a36fe43509d06499b9e512e70a028
timer: Use hlist for the timer wheel hash buckets

That change was part of 4.2 but if it would be the cause I would expect problems not only on AWS instances. But then it might just be that bare-metal servers with a similarly high traffic tend to be upgraded much less often.... anyway... Part of the change above seems to be some exchange of special meaning of list pointer values. Not sure I grasp the implications, yet. While using double linked lists before, the pointer to the next element seemed to serve as pending indicator and the pointer to the previous element was invalidated with a LIST_POISON2 value. Now its the other way round. Referring to the detach_timer function which is called from __run_timers via detached_expired_timer.

The crash happens at offset 0x116 in run_timer_softirq (thats 278 decimal). The disassembly of that function around there is:

   0xffffffff810e5c1e <+254>: mov %r15,0x8(%rbx)
   0xffffffff810e5c22 <+258>: nopl 0x0(%rax,%rax,1)
   // Guest this is __hlist_del(struct hlist_node *n)
   // rax = n->next
   0xffffffff810e5c27 <+263>: mov (%r15),%rax
   // rdx = n->ppev
   0xffffffff810e5c2a <+266>: mov 0x8(%r15),%rdx
   0xffffffff810e5c2e <+270>: test %rax,%rax
   // *(n->pprev) = n->next
   0xffffffff810e5c31 <+273>: mov %rax,(%rdx)
   // if (n->next == NULL) jump
   0xffffffff810e5c34 <+276>: je 0xffffffff810e5c3a <run_timer_softirq+282>
   // (n->next)->pprev = n->pprev (but n->next is LIST_POISON2 / invalid ptr)
   0xffffffff810e5c36 <+278>: mov %rdx,0x8(%rax)
   0xffffffff810e5c3a <+282>: testb $0x10,0x2a(%r15)
   // here we seem back at detach_timer inlined and clear_pending assumed true
   // entry->next = LIST_POISON2 and entry->pprev = NULL
   0xffffffff810e5c3f <+287>: movabs $0xdead000000200200,%rax
   0xffffffff810e5c49 <+297>: movq $0x0,0x8(%r15)
   0xffffffff810e5c51 <+305>: mov %rax,(%r15)

commit 1dabbcec2c0a36fe43509d06499b9e512e70a028
  timer: Use hlist for the timer wheel hash buckets

The crash happens at offset 0x116 in run_timer_softirq (thats 278 decimal). The disassembly of that function around there is:

0xffffffff810e5c1e <+254>:	mov    %r15,0x8(%rbx)
   0xffffffff810e5c22 <+258>:	nopl   0x0(%rax,%rax,1)
   // Guest this is __hlist_del(struct hlist_node *n)
   // rax = n->next
   0xffffffff810e5c27 <+263>:	mov    (%r15),%rax
   // rdx = n->ppev
   0xffffffff810e5c2a <+266>:	mov    0x8(%r15),%rdx
   0xffffffff810e5c2e <+270>:	test   %rax,%rax
   // *(n->pprev) = n->next
   0xffffffff810e5c31 <+273>:	mov    %rax,(%rdx)
   // if (n->next == NULL) jump
   0xffffffff810e5c34 <+276>:	je     0xffffffff810e5c3a <run_timer_softirq+282>
   // (n->next)->pprev = n->pprev (but n->next is LIST_POISON2 / invalid ptr)
   0xffffffff810e5c36 <+278>:	mov    %rdx,0x8(%rax)
   0xffffffff810e5c3a <+282>:	testb  $0x10,0x2a(%r15)
   // here we seem back at detach_timer inlined and clear_pending assumed true
   // entry->next = LIST_POISON2 and entry->pprev = NULL
   0xffffffff810e5c3f <+287>:	movabs $0xdead000000200200,%rax
   0xffffffff810e5c49 <+297>:	movq   $0x0,0x8(%r15)
   0xffffffff810e5c51 <+305>:	mov    %rax,(%r15)

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-20:

#10

Having written that down, it feels quite dangerous to use LIST_POISON in combination with __hlist_del as the latter only protects against the next pointer being NULL (indicating the last list entry). That definitively breaks when trying to detach the same timer twice. Need to think about whether this would have been ok before (with the double linked lists)...

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-21:

#11

More braindump: One thing that seems sure is that the crash occurs in __run_timers after optionally cascading pending timers into higher level lists. This picks a list of timers from the root vector array (tv1) into a temporary location and then removes timers from that (and calls actions) until the temporary list is empty. Looking at the registers from the trace in comment #3 and comparing with the disassembly I would say r15 contains the pointer to the currently processed timer (with the list pointers right at the beginning). rax is the next pointer and rdx the pointer to the previous elements address. rdx seems to be an address on the stack, so I would guess this is the temporary lists head. Which makes sense given that the loop always takes the first element from the work list and then deletes it from the list.
What should not happen is that this first element is already marked as deleted but still taken from the list. Because the poison value for the next pointer is only set after the element was unlinked from the list. One detail to note is that rdx still points to the previous elements address. If I looked correctly that would happen only from migrate_timers (which only happens if a cpu goes away, which is unlikely) or via __mod_timer() and that properly locks the base and would have taken the element out of list.
Oh wait... just realize that even if the bad list element had pprev set to NULL, that would be "fixed up" by hlist_move_list(). So any thoughts trying to narrow down potential sources of modification are invalid. Only that it had to be somehow through detach_timer() at some point is true... *sigh*

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-22:

#12

Hey Stefan,

Just wanted to update you that I've installed the Mainline 4.0.9 kernel (4.0.9-040009-generic #201507212131) on all affected machines, finishing about an hour ago, from https://wiki.ubuntu.com/Kernel/MainlineBuilds. I've still had two crashes in the hour since the downgrade.

I don't know enough about kernel code to provide feedback on the detail above, but thanks for looking into this!

I doubt that this could be related, but maybe it will help... We also appear to have one of our C utilities (developed in-house) crashing a few times a day:

xxx.log:[ 4740.992084] traps: checkport[6624] general protection ip:4c3d07 sp:7f8441ce8e20 error:0 in checkport[400000+ff000]
xxx.log:[60087.824087] traps: checkport[2203] general protection ip:4c3d07 sp:7fcc7df61e20 error:0 in checkport[400000+ff000]
xxx.log:[ 5375.784101] traps: checkport[10795] general protection ip:4c3d07 sp:7f96ebcbde20 error:0 in checkport[400000+ff000]

in addition to numerous segfaults:
xxx.log:[ 1354.456792] checkport[1644]: segfault at a8 ip 000000000049be3a sp 00007fea90080da0 error 4 in checkport[400000+ff000]
xxx.log:[ 1393.894368] checkport[2754]: segfault at a8 ip 000000000049be3a sp 00007f082e3c6da0 error 4 in checkport[400000+ff000]
xxx.log:[ 2474.260293] checkport[24996]: segfault at a8 ip 000000000049be3a sp 00007f569f6dcda0 error 4 in checkport[400000+ff000]
xxx.log:[ 4170.613174] checkport[23278]: segfault at a8 ip 000000000049be3a sp 00007fcf9b580da0 error 4 in checkport[400000+ff000]

The developers are looking into this on our end, but, we still had the same version of the tool and the same number of segfaults on our 3.19 kernel on 15.04 and it never triggered a kernel crash. Is it possible that invalid memory access in this binary could be triggering a kernel GPF? Or would that memory be protected?

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-22:

#13

And I should also point out that ifquery is crashing on boot most of the time on these systems:

xxx.log:[ 3.909680] ifquery[378]: segfault at 1 ip 0000000000403187 sp 00007fff7078d8c0 error 4 in ifup[400000+d000]
xxx.log:[ 3.008003] ifquery[380]: segfault at 1 ip 0000000000403187 sp 00007ffdf6935bc0 error 4 in ifup[400000+d000]
xxx.log:[ 2.647084] ifquery[370]: segfault at 1 ip 0000000000403187 sp 00007ffda45292f0 error 4 in ifup[400000+d000]
xxx.log:[ 2.868947] ifquery[370]: segfault at 1 ip 0000000000403187 sp 00007ffcd9140230 error 4 in ifup[400000+d000]
xxx.log:[ 2.846450] ifquery[398]: segfault at 1 ip 0000000000403187 sp 00007ffc5da3d670 error 4 in ifup[400000+d000]
xxx.log:[ 3.070464] ifquery[372]: segfault at 1 ip 0000000000403187 sp 00007fff78691b90 error 4 in ifup[400000+d000]

But I've been unable to get a coredump and it doesn't save a .crash and I can't reproduce it once the system is online. And this "checkport" utility couldn't possibly be running so early in boot--it's triggered by a web request and represents a relatively low percentage of traffic. It's unlikely that it would be triggered at all in the first 30 seconds after boot.

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-22:

#14

The ifquery segfaults should be safe to ignore. I see those a lot but those don't seem to have any impact (yeah, probably should be fixed at some point ... if there were not always more pressing matters). Your in-house utility, hard to say without knowing more about what causes the segfault there. If the kernel interfaces were all well written it should not be possible but of course they are not.
If you really get crashes (in the sense guest goes away completely like before) with 4.0, could you post me a console stacktrace. That should be different to some degree as the timer code changes were not done before 4.2.
Kexec crashdump only works on HVM guests if you were not using PV drivers for network and disk. Which could make the problem go away (as well as any good performance). To make it work with PV drivers needs a very recent kernel (I think 4.4) as well as some support on the host side (which there is no control over on AWS).

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-22:

#15

Oh hm, another crazy thought...maybe it would also be worth trying a 3.19 kernel... If the guests then crash as well then maybe something in the area of user-space (as in the whole release) is causing issues.

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-25:

#16

I have now downgraded all of these systems to 3.19.8-031908-generic (Mainline). We'll know in a couple of days if this fixed it (or as soon as a few hours if it didn't fix it, possibly). I'll update when I know anything else; good so far! 4.0.9 definitely didn't fix it completely, but MAY have made it less frequent.

Revision history for this message

Will Buckner (willbuckner) wrote on 2016-01-25:

#17

3.19 crashed as well, just now. This is surprising. I don't have a trace, as the kernel downgrades messed up netconsole somehow, but I'll try to get it working again.

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-26:

#18

Would be great if you were able to get it working again. At least to see whether the crash happens in the same area (timers). At the moment this sounds like something in user-space changed in a way that allows it to mess badly with the kernel. That sounds bad. And if its really somethign getting done with the timers that leaves a lot to look at.

Revision history for this message

Stefan Bader (smb) wrote on 2016-01-27:

#19

Hm one note, and sorry this probably not very trusting... have you ensured (uname -a) that you actually booted into the right older kernel. Its just easy to go wrong as normally only the kernel with the highest version number is used. One have to fiddle manually with /etc/default/grub (for HVM guests). Or work with grubenv, though I usually prefer just to replace GRUB_DEFAULT=0 with GRUB_DEFAULT="<string>" and run update-grub. The string I copy from the menuentry I want to use in /boot/grub/grub.cfg...

Revision history for this message

Stefan Bader (smb) wrote on 2016-02-04:

#20

Any news here? Also, it will help me to figure out what exact code level any crashing kernel was if you could post the "uname -a" output after booting into it on a likely affected machine.

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2020-01-22:

#21

15.10 EOL

Changed in linux (Ubuntu):
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced Networking

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package