Kernel crashes from time to time when using ftrace

Bug #1605843 reported by Nikolay Borisov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

While performing some tracing suing ftrace-cmd I came across the following OOPS:

[ 333.051723] invalid opcode: 0000 [#1] SMP
[ 333.051742] Modules linked in: drbg ansi_cprng ctr ccm xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 arc4 bridge stp llc ebtable_filter ebtables ath9k_htc ath9k_common ath9k_hw ath mac80211 cfg80211 binfmt_misc snd_hda_codec_hdmi dcdbas dell_smm_hwmon snd_hda_codec_realtek snd_hda_codec_generic intel_rapl uvcvideo snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp videobuf2_vmalloc snd_hda_codec videobuf2_memops input_leds videobuf2_v4l2 snd_usb_audio crct10dif_pclmul videobuf2_core snd_hda_core v4l2_common crc32_pclmul snd_usbmidi_lib videodev snd_hwdep aesni_intel media snd_pcm aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_seq_midi snd_seq_midi_event snd_rawmidi serio_raw snd_seq snd_seq_device snd_timer lpc_ich snd ie31200_edac
[ 333.051972] edac_core soundcore mei_me mei 8250_fintek mac_hid kvm_intel ip6t_REJECT nf_reject_ipv6 kvm nf_log_ipv6 irqbypass xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter ip_tables parport_pc x_tables ppdev lp parport autofs4 hid_apple raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log i915 i2c_algo_bit drm_kms_helper psmouse syscopyarea sysfillrect sysimgblt fb_sys_fops e1000e ahci libahci drm ptp
[ 333.052206] pps_core fjes video
[ 333.052216] CPU: 1 PID: 5616 Comm: trace-cmd Not tainted 4.4.0-31-generic #50-Ubuntu
[ 333.052235] Hardware name: Dell Inc. Precision T1650/0X9M3X, BIOS A15 09/09/2013
[ 333.052254] task: ffff8804066b1b80 ti: ffff88040b474000 task.ti: ffff88040b474000
[ 333.052272] RIP: 0010:[<ffffffff818302a8>] [<ffffffff818302a8>] ftrace_stub+0x0/0x8
[ 333.052296] RSP: 0018:ffff88040b477f00 EFLAGS: 00010286
[ 333.052309] RAX: 0000000000000000 RBX: ffff8800d9a4ec00 RCX: ffff88040b477f18
[ 333.052326] RDX: 0000000000002000 RSI: 000000000237d690 RDI: ffff8800d9a4ec00
[ 333.052343] RBP: ffff88040b477f48 R08: 00007f89df102cf8 R09: 0000000000000021
[ 333.052360] R10: 000000000000000d R11: 0000000000000246 R12: ffff8800d9a4ec00
[ 333.052377] R13: 000000000237d690 R14: 0000000000002000 R15: 000000000237d690
[ 333.052395] FS: 00007f89df50f700(0000) GS:ffff88041e240000(0000) knlGS:0000000000000000
[ 333.052414] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 333.052428] CR2: 0000000000a78d88 CR3: 00000003c3542000 CR4: 00000000001406e0
[ 333.052445] Stack:
[ 333.052451] ffffffff8120d165 ffffffff8120df35 00007fff364487e5 0000000000000000
[ 333.052473] 00000000eee5d05c 0000000000000000 0000000000000000 0000000000000006
[ 333.052494] 0000000000000008 0000000000002000 ffffffff8182db32 0000000000000004
[ 333.052515] Call Trace:
[ 333.052525] [<ffffffff8120d165>] ? vfs_read+0x5/0x130
[ 333.052538] [<ffffffff8120df35>] ? SyS_read+0x55/0xc0
[ 333.052553] [<ffffffff8182db32>] entry_SYSCALL_64_fastpath+0x16/0x71
[ 333.052568] Code: 8b 44 24 48 48 8b 7c 24 70 48 8b 74 24 68 48 8b 54 24 60 48 8b 4c 24 58 48 8b 44 24 50 48 8b 6c 24 20 48 81 c4 d0 00 00 00 e9 fd <ff> ff ff 80 00 00 00 00 9c 55 ff 74 24 18 55 48 89 e5 ff 74 24
[ 333.052685] RIP [<ffffffff818302a8>] ftrace_stub+0x0/0x8
[ 333.052700] RSP <ffff88040b477f00>

All code
========
   0: 8b 44 24 48 mov 0x48(%rsp),%eax
   4: 48 8b 7c 24 70 mov 0x70(%rsp),%rdi
   9: 48 8b 74 24 68 mov 0x68(%rsp),%rsi
   e: 48 8b 54 24 60 mov 0x60(%rsp),%rdx
  13: 48 8b 4c 24 58 mov 0x58(%rsp),%rcx
  18: 48 8b 44 24 50 mov 0x50(%rsp),%rax
  1d: 48 8b 6c 24 20 mov 0x20(%rsp),%rbp
  22: 48 81 c4 d0 00 00 00 add $0xd0,%rsp
  29:* e9 fd ff ff ff jmpq 0x2b <-- trapping instruction
  2e: 80 00 00 addb $0x0,(%rax)
  31: 00 00 add %al,(%rax)
  33: 9c pushfq
  34: 55 push %rbp
  35: ff 74 24 18 pushq 0x18(%rsp)
  39: 55 push %rbp
  3a: 48 89 e5 mov %rsp,%rbp
  3d: ff .byte 0xff
  3e: 74 24 je 0x64

Code starting with the faulting instruction
===========================================
   0: ff (bad)
   1: ff (bad)
   2: ff 80 00 00 00 00 incl 0x0(%rax)
   8: 9c pushfq
   9: 55 push %rbp
   a: ff 74 24 18 pushq 0x18(%rsp)
   e: 55 push %rbp
   f: 48 89 e5 mov %rsp,%rbp
  12: ff .byte 0xff
  13: 74 24 je 0x39

The way I was running trace-cmd was:

trace-cmd stream -p function -l vfs_read -F ls

But the same crash occured if I ran 'trace-cmd record -p function -l vfs_read -F ls'

What's interesting is this doesn't happen always but will usually occur one out of 10 times or so. Apparently it goes bogus in the mcount handler:
addr2line -e /vmlinux ffffffff818302a8
/build/linux-dcxD3m/linux-4.4.0/arch/x86/kernel/mcount_64.S:184

I managed to also capture a complete kernel crashdump so if you need any other relevant information (diassembly of relvant function) I'm happy to provide it.

Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1605843

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

Dmesg as seen from the crash dump image.

Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :
Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

So here is a disassembly of the ftrace functions:

0xffffffff8183026b <ftrace_caller_op_ptr+7>: mov $0x0,%rcx
0xffffffff81830272 <ftrace_call>: callq 0xffffffff81156d80 <function_trace_call>
0xffffffff81830277 <ftrace_call+5>: mov 0x40(%rsp),%r9
0xffffffff8183027c <ftrace_call+10>: mov 0x48(%rsp),%r8
0xffffffff81830281 <ftrace_call+15>: mov 0x70(%rsp),%rdi
0xffffffff81830286 <ftrace_call+20>: mov 0x68(%rsp),%rsi
0xffffffff8183028b <ftrace_call+25>: mov 0x60(%rsp),%rdx
0xffffffff81830290 <ftrace_call+30>: mov 0x58(%rsp),%rcx
0xffffffff81830295 <ftrace_call+35>: mov 0x50(%rsp),%rax
0xffffffff8183029a <ftrace_call+40>: mov 0x20(%rsp),%rbp
0xffffffff8183029f <ftrace_call+45>: add $0xd0,%rsp
0xffffffff818302a6 <ftrace_caller_end>: jmpq 0xffffffff818302a8 <ftrace_stub>

And here is disassembly of ftrace_stub, which clearly looks broken:
crash> dis ftrace_stub 10
0xffffffff818302a8 <ftrace_stub>: (bad)
0xffffffff818302a9 <ftrace_stub+1>: (bad)
0xffffffff818302aa <ftrace_stub+2>: incl 0x0(%rax)
0xffffffff818302b0 <ftrace_regs_caller>: pushfq
0xffffffff818302b1 <ftrace_regs_caller+1>: push %rbp
0xffffffff818302b2 <ftrace_regs_caller+2>: pushq 0x18(%rsp)
0xffffffff818302b6 <ftrace_regs_caller+6>: push %rbp
0xffffffff818302b7 <ftrace_regs_caller+7>: mov %rsp,%rbp
0xffffffff818302ba <ftrace_regs_caller+10>: pushq 0x20(%rsp)
0xffffffff818302be <ftrace_regs_caller+14>: push %rbp

crash> rd 0xffffffff818302a8
ffffffff818302a8: 0000000080ffffff

So I guess the reason is that ftrace_stub somehow gets corrupted.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.7 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

I tried with the upstream kernel :
nborisov@fisk:~$ uname -a
Linux fisk 4.7.0-040700-generic #201607241632 SMP Sun Jul 24 20:34:30 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

In the meantime I managed to find a sequence which can trigger this 100% of the time on the ubuntu kernel. The following commands have to be one after the other:

sudo trace-cmd record -p function_graph -l vfs_read -F ls
sudo trace-cmd record -p function -l vfs_read -F ls

tags: added: kernel-fixed-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

Is there a way to obtain a vmlinux image for the upstream kernel as built by ubuntu (without having to compile locally) so that I can compare the disassemblies of the ftrace_stub ?

Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

I think the fix for this crash is this upstream commit: 8329e818f14926a6040df86b2668568bde342ebf

Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

Original report of the exact same issue: https://lkml.org/lkml/2016/5/13/327

Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

Ping on the issue. I now also managed to trigger it while setting a very simple systemtap probe:

probe kernel.function("generic_fillattr") {
 printf("Uid in inode: %u gid: %u\n", $inode->i_uid->val, $inode->i_gid->val);
}

Running this resulted in:

[854807.910390] invalid opcode: 0000 [#1] SMP
[854807.911398] CPU: 6 PID: 1 Comm: systemd Tainted: G OE 4.4.0-31-generic #50-Ubuntu
[854807.911436] Hardware name: Dell Inc. Precision T1650/0X9M3X, BIOS A20 12/21/2015
[854807.911469] task: ffff88040be58000 ti: ffff88040be60000 task.ti: ffff88040be60000
[854807.911503] RIP: 0010:[<ffffffff818302a8>] [<ffffffff818302a8>] ftrace_stub+0x0/0x8
[854807.911544] RSP: 0018:ffff88040be63e50 EFLAGS: 00010246
[854807.911569] RAX: 0000000000000000 RBX: ffff880405bcf410 RCX: ffff88040378c6c0
[854807.911611] RDX: ffff88040d891b60 RSI: ffff88040be63ec0 RDI: ffff88014b51dd78
[854807.911632] RBP: ffff88040be63e60 R08: ffff88014b51dd78 R09: ffff88040be63e20
[854807.911672] R10: 00007fc4541498c0 R11: 0000000000000000 R12: ffff88040be63ec0
[854807.911703] R13: ffff88040be63ec0 R14: 0000000000000000 R15: 000055de063aff00
[854807.911736] FS: 00007fc4541498c0(0000) GS:ffff88041e380000(0000) knlGS:0000000000000000
[854807.911771] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[854807.911797] CR2: 00007f891558de00 CR3: 0000000405a4e000 CR4: 00000000001406e0
[854807.911828] Stack:
[854807.911839] ffffffff81211cd5 ffffffff81211da6 ffff88040be63e80 ffffffff81211fb6
[854807.911878] ffff880405bcf400 ffff880405bcf400 ffff88040be63eb0 ffffffff81211ff3
[854807.911917] 0000000000088000 00007ffc05f286b0 000000000000000a 00000000000007ff
[854807.911956] Call Trace:
[854807.911972] [<ffffffff81211cd5>] ? generic_fillattr+0x5/0xa0
[854807.911999] [<ffffffff81211da6>] ? vfs_getattr_nosec+0x36/0x40
[854807.912027] [<ffffffff81211fb6>] vfs_getattr+0x26/0x30
[854807.912052] [<ffffffff81211ff3>] vfs_fstat+0x33/0x60
[854807.912076] [<ffffffff812126f4>] SYSC_newfstat+0x24/0x60
[854807.912102] [<ffffffff8120c2cf>] ? do_sys_open+0x1bf/0x2a0
[854807.912129] [<ffffffff8121279e>] SyS_newfstat+0xe/0x10
[854807.912154] [<ffffffff8182db32>] entry_SYSCALL_64_fastpath+0x16/0x71
[854807.912183] Code: 8b 44 24 48 48 8b 7c 24 70 48 8b 74 24 68 48 8b 54 24 60 48 8b 4c 24 58 48 8b 44 24 50 48 8b 6c 24 20 48 81 c4 d0 00 00 00 e9 fd <ff> ff ff 80 00 00 00 00 9c 55 ff 74 24 18 55 48 89 e5 ff 74 24
[854807.912392] RIP [<ffffffff818302a8>] ftrace_stub+0x0/0x8
[854807.912420] RSP <ffff88040be63e50>

Revision history for this message
Nikolay Borisov (n-borisov-lkml) wrote :

This is still not fixed in the latest ubuntu kernel - 4.4.0-38-generic #57-Ubuntu. Considering the low-impact of the fix is this going to go in future 4.4 releases?

Revision history for this message
Andy Whitcroft (apw) wrote :

This commit was applied in Ubuntu-4.4.0-67.88. Closing this out.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.