StarlingX

kernel oops and reboot on steady state AIO-DX controller

Bug #1837430 reported by Bart Wensley on 2019-07-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Bin Yang

Bug Description

Brief Description
-----------------
After installing an AIO-DX (two node) system with the stx-openstack application, controller-0 had a kernel oops and spontaneously rebooted after being up for about 58 hours. There were no nova instances running and no other activity on the system at the time.

Severity
--------
Major: controller should not spontaneously reboot

Steps to Reproduce
------------------
Install AIO-DX system. Wait.

Expected Behavior
------------------
No kernel oops or spontaneous reboots.

Actual Behavior
----------------
The following kernel oops occurred:

[211368.883501] BUG: unable to handle kernel paging request at 0000000000001118
[211368.890583] IP: [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0
[211368.897121] PGD b84e39067 PUD 0
[211368.900476] Oops: 0000 [#1] PREEMPT SMP
[211368.904540] Modules linked in: xt_REDIRECT nf_nat_redirect xt_connmark ip6table_raw ip6table_mangle xt_CHECKSUM ebtable_filter ebtables ip6table_filter vxlan ip6_udp_tunnel udp_tunnel ip_gre gre ip6_tables xt_recent rbd libceph dns_resolver xt_statistic nbd tun ipt_REJECT nf_reject_ipv4 openvswitch nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 vfio_pci vfio_iommu_type1 xt_multiport xt_set iptable_mangle iptable_raw ip_set_hash_net ip_set_hash_ip ip_set ipip tunnel4 ip_tunnel veth nf_conntrack_netlink xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_filter xt_mark xt_nat xt_comment iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables binfmt_misc drbd(O) virtio_net nfnetlink xt_addrtype br_netfilter bridge stp llc nfsd auth_rpcgss nfs_acl lockd grace overlay(T)
[211368.976840] sunrpc xfs libcrc32c iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd dm_mod i2c_i801 joydev mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel vfio xprtrdma(O) svcrdma(O) rpcrdma(O) nvmet_rdma(O) nvme_rdma(O) mlx4_en(O) ib_srp(O) ib_isert(O) ib_iser(O) rdma_rxe(O) mlx5_ib(O) mlx5_core(O) mlxfw(O) mlx4_ib(O) mlx4_core(O) devlink rdma_ucm(O) rdma_cm(O) iw_cm(O) ib_ucm(O) ib_uverbs(O) ib_cm(O) ib_core(O) mlx_compat(O) ixgbevf(O) ixgbe(O) dca tpm_tis(O) tpm_tis_core(O) tpm(O) i40evf(O) uas usb_storage ahci libahci nfit libnvdimm i40e(O) e1000e(O) [last unloaded: nf_defrag_ipv4]
[211369.049922]
[211369.050126] CPU: 1 PID: 904505 Comm: nginx Kdump: loaded Tainted: G O ------------ T 3.10.0-957.12.2.el7.2.tis.x86_64 #1
[211369.062079] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.00.01.0015.110720180833 11/07/2018
[211369.072568] task: ffff9466c4ef45c0 ti: ffff9470bdcf4000 task.ti: ffff9470bdcf4000
[211369.080114] RIP: 0010:[<ffffffff8aa4c3a0>] [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0
[211369.089075] RSP: 0018:ffff9470bdcf7e60 EFLAGS: 00010286
[211369.094458] RAX: 0000000000001120 RBX: 00000000000010c8 RCX: ffffffff8b720e60
[211369.101661] RDX: ffffffff8b65db80 RSI: ffffffff8aa4afd0 RDI: ffff94724125bec8
[211369.108863] RBP: ffff9470bdcf7e90 R08: 0000000000000000 R09: ffffffff8aa4b553
[211369.116065] R10: ffff94732b2f82b0 R11: ffff947143b4ce10 R12: ffff94724125bec8
[211369.123267] R13: ffff94724125bec0 R14: ffff947143b4ceb0 R15: ffff94717b2601a0
[211369.130471] FS: 00007f1b2b3f2740(0000) GS:ffff9471cf240000(0000) knlGS:0000000000000000
[211369.138625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[211369.144444] CR2: 0000000000001118 CR3: 00000008516c4000 CR4: 00000000007607e0
[211369.151643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[211369.158845] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[211369.166041] PKRU: 55555554
[211369.168834] Call Trace:
[211369.171373] [<ffffffff8aa019d6>] __fput+0x256/0x260
[211369.176416] [<ffffffff8aa01abe>] ____fput+0xe/0x10
[211369.181374] [<ffffffff8a8a2b8b>] task_work_run+0xbb/0xe0
[211369.186846] [<ffffffff8a81bd38>] do_notify_resume+0x98/0xb0
[211369.192579] [<ffffffff8b015374>] int_signal+0x12/0x17
[211369.197789] Code: 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 e8 27 24 5c 00 49 8b 06 48 89 45 d0 48 8b 45 d0 48 8d 58 a8 49 39 c6 74 3b 0f 1f 00 <4c> 8b 6b 50 4d 8d 65 08 4c 89 e7 e8 00 24 5c 00 48 89 de 4c 89
[211369.218036] RIP [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0
[211369.224651] RSP <ffff9470bdcf7e60>
[211369.228221] CR2: 0000000000001118

Reproducibility
---------------
Unsure

System Configuration
--------------------
AIO-DX (two node) system

Branch/Pull Time/Commit
-----------------------
Designer load:
BUILD_DATE="2019-07-19 09:53:25 -0500"

Last Pass
---------
N/A

Timestamp/Logs
--------------
Collect logs and core dump will be attached.

Test Activity
-------------
Developer testing

Tags:

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-22:

controller-0_20190722.141546.tar Edit (66.8 MiB, application/x-tar)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-22:

controller-1_20190722.135842.tgz Edit (61.0 MiB, application/x-tar)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-22:

I am unable to attach the core dump due to its size - please contact me if you need it.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-22:

Assigning to the distro.other team to investigate and determine release gate.

tags:	added: stx.distro.other
Changed in starlingx:
assignee:	nobody → Cindy Xie (xxie1)

Cindy Xie (xxie1) on 2019-07-22

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Bin Yang (byangintel)

Cindy Xie (xxie1) on 2019-07-23

Changed in starlingx:
importance:	Undecided → High

Revision history for this message

Bin Yang (byangintel) wrote on 2019-07-23:

Hi Bart,
could you share the core dump file to me?

thanks,
Bin

Revision history for this message

Bin Yang (byangintel) wrote on 2019-07-24:

Hi Bart,
could you please share kernel rpm, kernel src rpm and kernel debuginfo rpm to me? Please make sure they are corresponding to the crash dump.

thanks,
Bin

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-07-24:

due to the fact that the kernel was not upgraded for weeks, just use the kernel rpm with debug info from Cengen, it should work.

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-07-24:

@Bart, is the build from Cengen? or this is your dev build? We want to get the exact debugInfo corresponding your core dump info (Bin got it).

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-24:

@Cindy - this was a private build. I have provided the RPMs to Bin.

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-07-24:

#10

Download full text (6.5 KiB)

From the kernel log we have:

1. "BUG: unable to handle kernel paging request at 0000000000001118"
Small address suggests NULL pointer dereference issue.

2. other registers with similar values:
RAX: 0000000000001120
RBX: 00000000000010c8

3. instruction pointer where crash occured:
"RIP [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0"

4. instructions surrounding code that generated the page fault
(note the < > marker surrounding last executed instruction):

  Code: 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 e8 27 24 5c
        00 49 8b 06 48 89 45 d0 48 8b 45 d0 48 8d 58 a8 49 39 c6
        74 3b 0f 1f 00 <4c> 8b 6b 50 4d 8d 65 08 4c 89 e7 e8 00 24
        5c 00 48 89 de 4c 89

When disassembled, the offending code looks like:

and if we can confirm the NULL pointer dereference caused by

mov r13,QWORD PTR [rbx+0x50]

Current code (comments stripped):

  void eventpoll_release_file(struct file *file)
  {
          struct eventpoll *ep;
          struct epitem *epi;

          mutex_lock(&epmutex);
          list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
                  ep = epi->ep;
                  mutex_lock_nested(&ep->mtx, 0);
                  ep_remove(ep, epi);
                  mutex_unlock(&ep->mtx);
          }
          mutex_unlock(&epmutex);
  }

To confirm we're looking at the same source, let's use debug symbols
to identify the byte code sequence:

1. unpack kernel sources in build workspace and note where the intermediate
source directory is located:

build-srpm-serial --edit kernel
# 15:53:49 ===== Source code can be found at: /localdisk/loadbuild/jenkins/test/std/srpm_work/kernel/gits/kernel.spec/kernel-3.10.0-957.12.2.el7/linux-3.10.0-957.12.2.el7

2. use kernel-debuginfo rpm to disassemble eventpoll_release_file:

    cd $MY_WORKSPACE
    KSRC_DIR=/localdisk/loadbuild/jenkins/test/std/srpm_work/kernel/gits/kernel.spec/kernel-3.10.0-957.12.2.el7/linux-3.10.0-957.12.2.el7
    KDBG_RPM=$(realpath ./std/rpmbuild/RPMS/kernel-debuginfo-3.10.0-957.12.2.el7.2.tis.x86_64.rpm)
    VMLINUX=$(rpm2cpio ${KDBG_RPM} | cpio -t | grep vmlinux)
    cd /tmp
    rpm2cpio ${KDBG_RPM} | cpio -idv ${VMLINUX}
    gdb --batch --command=/dev/stdin ${VMLINUX} <<...

From the kernel log we have:

1. "BUG: unable to handle kernel paging request at 0000000000001118"
   Small address suggests NULL pointer dereference issue.
   
2. other registers with similar values:
   RAX: 0000000000001120
   RBX: 00000000000010c8
   
3. instruction pointer where crash occured:
   "RIP [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0"
   
4. instructions surrounding code that generated the page fault
   (note the < > marker surrounding last executed instruction):

When disassembled, the offending code looks like:

0:  65 48 8b 04 25 28 00    mov    rax,QWORD PTR gs:0x28
7:  00 00
9:  48 89 45 d8             mov    QWORD PTR [rbp-0x28],rax
d:  31 c0                   xor    eax,eax
f:  e8 27 24 5c 00          call   0x5c243b
14: 49 8b 06                mov    rax,QWORD PTR [r14]
17: 48 89 45 d0             mov    QWORD PTR [rbp-0x30],rax
1b: 48 8b 45 d0             mov    rax,QWORD PTR [rbp-0x30]
1f: 48 8d 58 a8             lea    rbx,[rax-0x58]
23: 49 39 c6                cmp    r14,rax                  <-- while loop
26: 74 3b                   je     0x63
28: 0f 1f 00                nop    DWORD PTR [rax]
2b: 4c 8b 6b 50             mov    r13,QWORD PTR [rbx+0x50] <--- crash here
2f: 4d 8d 65 08             lea    r12,[r13+0x8]
33: 4c 89 e7                mov    rdi,r12
36: e8 00 24 5c 00          call   0x5c243b
3b: 48 89 de                mov    rsi,rbx
3e: 4c                      rex.WR
3f: 89                      .byte 0x89

and if we can confirm the NULL pointer dereference caused by

mov    r13,QWORD PTR [rbx+0x50]
   
Current code (comments stripped):

void eventpoll_release_file(struct file *file)
  {
          struct eventpoll *ep;
          struct epitem *epi;

To confirm we're looking at the same source, let's use debug symbols
to identify the byte code sequence:

1. unpack kernel sources in build workspace and note where the intermediate
   source directory is located:
   
    build-srpm-serial --edit kernel
    # 15:53:49 ===== Source code can be found at: /localdisk/loadbuild/jenkins/test/std/srpm_work/kernel/gits/kernel.spec/kernel-3.10.0-957.12.2.el7/linux-3.10.0-957.12.2.el7

2. use kernel-debuginfo rpm to disassemble eventpoll_release_file:

we get:
  ...
  933             mutex_lock(&epmutex);     
    <+16>:    mov    rdi,0xffffffff81e82b80
    <+52>:    call   0xffffffff8180e7b0 <mutex_lock>

934             list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
    <+23>:    add    r14,0xb0                             
    <+57>:    mov    rax,QWORD PTR [r14]
    <+60>:    mov    QWORD PTR [rbp-0x30],rax
    <+64>:    mov    rax,QWORD PTR [rbp-0x30]
    <+68>:    lea    rbx,[rax-0x58]
    <+72>:    cmp    r14,rax
    <+75>:    je     0xffffffff8124c3d8 <eventpoll_release_file+136>
    <+77>:    nop    DWORD PTR [rax]
    <+115>:   mov    rax,QWORD PTR [rbx+0x58]
    <+119>:   mov    QWORD PTR [rbp-0x30],rax
    <+123>:   mov    rax,QWORD PTR [rbp-0x30]
    <+127>:   lea    rbx,[rax-0x58]
    <+131>:   cmp    r14,rax
    <+134>:   jne    0xffffffff8124c3a0 <eventpoll_release_file+80>

935                     ep = epi->ep;
    <+80>:    mov    r13,QWORD PTR [rbx+0x50]
  ...

where we can identify "ep = epi-> ep" as the line that caused the page
fault.

Git blame in upstream source code fs/eventpoll.c reveals a bug fix in this
area of code for a use-after-free issue:

https://github.com/torvalds/linux/commit/ebe06187bf2aec10d537ce4595e416035367d703

> From ebe06187bf2aec10d537ce4595e416035367d703 Mon Sep 17 00:00:00 2001
> From: Konstantin Khlebnikov <koct9i@gmail.com>
> Date: Tue, 17 Jun 2014 06:58:05 +0400
> Subject: [PATCH] epoll: fix use-after-free in eventpoll_release_file

> This fixes use-after-free of epi->fllink.next inside list loop macro.
> This loop actually releases elements in the body.  The list is
> rcu-protected but here we cannot hold rcu_read_lock because we need to
> lock mutex inside.

> The obvious solution is to use list_for_each_entry_safe().  RCU-ness
> isn't essential because nobody can change this list under us, it's final
> fput for this file.

> The bug was introduced by ae10b2b4eb01 ("epoll: optimize EPOLL_CTL_DEL
> using rcu")

> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Stable <stable@vger.kernel.org> # 3.13+
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> fs/eventpoll.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)

> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index b73e0621ce9e79..b10b48c2a7afaf 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -910,7 +910,7 @@ static const struct file_operations eventpoll_fops = {
> void eventpoll_release_file(struct file *file)
> {
>   struct eventpoll *ep;
> -	struct epitem *epi;
> +	struct epitem *epi, *next;

>   /*
>   * We don't want to get "file->f_lock" because it is not
> @@ -926,7 +926,7 @@ void eventpoll_release_file(struct file *file)
>   * Besides, ep_remove() acquires the lock, so we can't hold it here.
>   */
>   mutex_lock(&epmutex);
> -	list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
> +	list_for_each_entry_safe(epi, next, &file->f_ep_links, fllink) {
>     ep = epi->ep;
>     mutex_lock_nested(&ep->mtx, 0);
>     ep_remove(ep, epi);

My guess is our issue is similar and should be fixed by back-porting
the upstream patch.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-24:

#11

Adding the stx.2.0 label since this was marked as high priority by the distro.other PL

tags:	added: stx.2.0
Changed in starlingx:
status:	New → Triaged

Revision history for this message

Bin Yang (byangintel) wrote on 2019-07-25:

#12

Hi Daniel,
I think you are right.

thanks,
Bin

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix proposed to integ (master)

#13

Fix proposed to branch: master
Review: https://review.opendev.org/672625

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-25:

#14

Note that I saw this again on the same AIO-DX hardware lab in Ottawa (wolfpass 1-2). The oops has the same traceback and happened after the controller had been up for only 4 hours.

Revision history for this message

Bin Yang (byangintel) wrote on 2019-07-26:

#15

Based on the code review, it is an obvious problem which was introduced by race condition.

The epi is removed from list by list_del_rcu(&epi->fllink) under list_for_each_entry_rcu() without rcu_read_lock. If the rcu grace-period thread free epi before next list_for_each loop, the content of epi will be corrupted.

Normally, the rcu read should be protected by rcu_read_lock. For performance optimization, it does not hold the lock here. But it removes the item from list and call_rcu() to free the pointer. The rcu task is running in background with about (HZ/10). The gap between call_rcu() and next loop is only a few C lines without sleep/sched. So it is very hard to reproduce it.

I think this is the reason that this patch is not in original Centos kernel. I checked it in https://git.centos.org/rpms/kernel/raw/c7/f/SPECS/kernel.spec, and did not found this patch currently.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-31: Fix merged to integ (master)

#16

Reviewed: https://review.opendev.org/672625
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=853388e615b6fb4fbec7fded3eb2652fe0120527
Submitter: Zuul
Branch: master

commit 853388e615b6fb4fbec7fded3eb2652fe0120527
Author: Bin Yang <email address hidden>
Date: Wed Jul 31 14:12:05 2019 +0800

epoll: fix use-after-free in eventpoll_release_file

back port upstream patch
https://github.com/torvalds/linux/commit/ebe06187bf2aec10d537ce4595e416035367d703

the epi is removed from list by list_del_rcu(&epi->fllink);
under list_for_each_entry_rcu() without rcu_read_lock.

if the rcu grace-period thread free epi before next list_for_each loop,
the content of epi will be corrupted.

    Change-Id: I75dbf8ada5ca4734761efe260ca6d6f85886b180
    Closes-Bug: 1837430
    Suggested-by: <email address hidden>
    Signed-off-by: Bin Yang <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.