kernel oops and reboot on steady state AIO-DX controller

Bug #1837430 reported by Bart Wensley
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Yang

Bug Description

Brief Description
-----------------
After installing an AIO-DX (two node) system with the stx-openstack application, controller-0 had a kernel oops and spontaneously rebooted after being up for about 58 hours. There were no nova instances running and no other activity on the system at the time.

Severity
--------
Major: controller should not spontaneously reboot

Steps to Reproduce
------------------
Install AIO-DX system. Wait.

Expected Behavior
------------------
No kernel oops or spontaneous reboots.

Actual Behavior
----------------
The following kernel oops occurred:

[211368.883501] BUG: unable to handle kernel paging request at 0000000000001118
[211368.890583] IP: [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0
[211368.897121] PGD b84e39067 PUD 0
[211368.900476] Oops: 0000 [#1] PREEMPT SMP
[211368.904540] Modules linked in: xt_REDIRECT nf_nat_redirect xt_connmark ip6table_raw ip6table_mangle xt_CHECKSUM ebtable_filter ebtables ip6table_filter vxlan ip6_udp_tunnel udp_tunnel ip_gre gre ip6_tables xt_recent rbd libceph dns_resolver xt_statistic nbd tun ipt_REJECT nf_reject_ipv4 openvswitch nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 vfio_pci vfio_iommu_type1 xt_multiport xt_set iptable_mangle iptable_raw ip_set_hash_net ip_set_hash_ip ip_set ipip tunnel4 ip_tunnel veth nf_conntrack_netlink xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_filter xt_mark xt_nat xt_comment iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables binfmt_misc drbd(O) virtio_net nfnetlink xt_addrtype br_netfilter bridge stp llc nfsd auth_rpcgss nfs_acl lockd grace overlay(T)
[211368.976840] sunrpc xfs libcrc32c iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd dm_mod i2c_i801 joydev mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel vfio xprtrdma(O) svcrdma(O) rpcrdma(O) nvmet_rdma(O) nvme_rdma(O) mlx4_en(O) ib_srp(O) ib_isert(O) ib_iser(O) rdma_rxe(O) mlx5_ib(O) mlx5_core(O) mlxfw(O) mlx4_ib(O) mlx4_core(O) devlink rdma_ucm(O) rdma_cm(O) iw_cm(O) ib_ucm(O) ib_uverbs(O) ib_cm(O) ib_core(O) mlx_compat(O) ixgbevf(O) ixgbe(O) dca tpm_tis(O) tpm_tis_core(O) tpm(O) i40evf(O) uas usb_storage ahci libahci nfit libnvdimm i40e(O) e1000e(O) [last unloaded: nf_defrag_ipv4]
[211369.049922]
[211369.050126] CPU: 1 PID: 904505 Comm: nginx Kdump: loaded Tainted: G O ------------ T 3.10.0-957.12.2.el7.2.tis.x86_64 #1
[211369.062079] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.00.01.0015.110720180833 11/07/2018
[211369.072568] task: ffff9466c4ef45c0 ti: ffff9470bdcf4000 task.ti: ffff9470bdcf4000
[211369.080114] RIP: 0010:[<ffffffff8aa4c3a0>] [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0
[211369.089075] RSP: 0018:ffff9470bdcf7e60 EFLAGS: 00010286
[211369.094458] RAX: 0000000000001120 RBX: 00000000000010c8 RCX: ffffffff8b720e60
[211369.101661] RDX: ffffffff8b65db80 RSI: ffffffff8aa4afd0 RDI: ffff94724125bec8
[211369.108863] RBP: ffff9470bdcf7e90 R08: 0000000000000000 R09: ffffffff8aa4b553
[211369.116065] R10: ffff94732b2f82b0 R11: ffff947143b4ce10 R12: ffff94724125bec8
[211369.123267] R13: ffff94724125bec0 R14: ffff947143b4ceb0 R15: ffff94717b2601a0
[211369.130471] FS: 00007f1b2b3f2740(0000) GS:ffff9471cf240000(0000) knlGS:0000000000000000
[211369.138625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[211369.144444] CR2: 0000000000001118 CR3: 00000008516c4000 CR4: 00000000007607e0
[211369.151643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[211369.158845] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[211369.166041] PKRU: 55555554
[211369.168834] Call Trace:
[211369.171373] [<ffffffff8aa019d6>] __fput+0x256/0x260
[211369.176416] [<ffffffff8aa01abe>] ____fput+0xe/0x10
[211369.181374] [<ffffffff8a8a2b8b>] task_work_run+0xbb/0xe0
[211369.186846] [<ffffffff8a81bd38>] do_notify_resume+0x98/0xb0
[211369.192579] [<ffffffff8b015374>] int_signal+0x12/0x17
[211369.197789] Code: 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 e8 27 24 5c 00 49 8b 06 48 89 45 d0 48 8b 45 d0 48 8d 58 a8 49 39 c6 74 3b 0f 1f 00 <4c> 8b 6b 50 4d 8d 65 08 4c 89 e7 e8 00 24 5c 00 48 89 de 4c 89
[211369.218036] RIP [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0
[211369.224651] RSP <ffff9470bdcf7e60>
[211369.228221] CR2: 0000000000001118

Reproducibility
---------------
Unsure

System Configuration
--------------------
AIO-DX (two node) system

Branch/Pull Time/Commit
-----------------------
Designer load:
BUILD_DATE="2019-07-19 09:53:25 -0500"

Last Pass
---------
N/A

Timestamp/Logs
--------------
Collect logs and core dump will be attached.

Test Activity
-------------
Developer testing

Revision history for this message
Bart Wensley (bartwensley) wrote :
Revision history for this message
Bart Wensley (bartwensley) wrote :
Revision history for this message
Bart Wensley (bartwensley) wrote :

I am unable to attach the core dump due to its size - please contact me if you need it.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to the distro.other team to investigate and determine release gate.

tags: added: stx.distro.other
Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Bin Yang (byangintel)
Cindy Xie (xxie1)
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Bin Yang (byangintel) wrote :

Hi Bart,
    could you share the core dump file to me?

thanks,
Bin

Revision history for this message
Bin Yang (byangintel) wrote :

Hi Bart,
    could you please share kernel rpm, kernel src rpm and kernel debuginfo rpm to me? Please make sure they are corresponding to the crash dump.

thanks,
Bin

Revision history for this message
Cindy Xie (xxie1) wrote :

due to the fact that the kernel was not upgraded for weeks, just use the kernel rpm with debug info from Cengen, it should work.

Revision history for this message
Cindy Xie (xxie1) wrote :

@Bart, is the build from Cengen? or this is your dev build? We want to get the exact debugInfo corresponding your core dump info (Bin got it).

Revision history for this message
Bart Wensley (bartwensley) wrote :

@Cindy - this was a private build. I have provided the RPMs to Bin.

Revision history for this message
Daniel Badea (daniel.badea) wrote :
Download full text (6.5 KiB)

From the kernel log we have:

1. "BUG: unable to handle kernel paging request at 0000000000001118"
   Small address suggests NULL pointer dereference issue.

2. other registers with similar values:
   RAX: 0000000000001120
   RBX: 00000000000010c8

3. instruction pointer where crash occured:
   "RIP [<ffffffff8aa4c3a0>] eventpoll_release_file+0x50/0xc0"

4. instructions surrounding code that generated the page fault
   (note the < > marker surrounding last executed instruction):

  Code: 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 e8 27 24 5c
        00 49 8b 06 48 89 45 d0 48 8b 45 d0 48 8d 58 a8 49 39 c6
        74 3b 0f 1f 00 <4c> 8b 6b 50 4d 8d 65 08 4c 89 e7 e8 00 24
        5c 00 48 89 de 4c 89

When disassembled, the offending code looks like:

0: 65 48 8b 04 25 28 00 mov rax,QWORD PTR gs:0x28
7: 00 00
9: 48 89 45 d8 mov QWORD PTR [rbp-0x28],rax
d: 31 c0 xor eax,eax
f: e8 27 24 5c 00 call 0x5c243b
14: 49 8b 06 mov rax,QWORD PTR [r14]
17: 48 89 45 d0 mov QWORD PTR [rbp-0x30],rax
1b: 48 8b 45 d0 mov rax,QWORD PTR [rbp-0x30]
1f: 48 8d 58 a8 lea rbx,[rax-0x58]
23: 49 39 c6 cmp r14,rax <-- while loop
26: 74 3b je 0x63
28: 0f 1f 00 nop DWORD PTR [rax]
2b: 4c 8b 6b 50 mov r13,QWORD PTR [rbx+0x50] <--- crash here
2f: 4d 8d 65 08 lea r12,[r13+0x8]
33: 4c 89 e7 mov rdi,r12
36: e8 00 24 5c 00 call 0x5c243b
3b: 48 89 de mov rsi,rbx
3e: 4c rex.WR
3f: 89 .byte 0x89

and if we can confirm the NULL pointer dereference caused by

   mov r13,QWORD PTR [rbx+0x50]

Current code (comments stripped):

  void eventpoll_release_file(struct file *file)
  {
          struct eventpoll *ep;
          struct epitem *epi;

          mutex_lock(&epmutex);
          list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
                  ep = epi->ep;
                  mutex_lock_nested(&ep->mtx, 0);
                  ep_remove(ep, epi);
                  mutex_unlock(&ep->mtx);
          }
          mutex_unlock(&epmutex);
  }

To confirm we're looking at the same source, let's use debug symbols
to identify the byte code sequence:

1. unpack kernel sources in build workspace and note where the intermediate
   source directory is located:

    build-srpm-serial --edit kernel
    # 15:53:49 ===== Source code can be found at: /localdisk/loadbuild/jenkins/test/std/srpm_work/kernel/gits/kernel.spec/kernel-3.10.0-957.12.2.el7/linux-3.10.0-957.12.2.el7

2. use kernel-debuginfo rpm to disassemble eventpoll_release_file:

    cd $MY_WORKSPACE
    KSRC_DIR=/localdisk/loadbuild/jenkins/test/std/srpm_work/kernel/gits/kernel.spec/kernel-3.10.0-957.12.2.el7/linux-3.10.0-957.12.2.el7
    KDBG_RPM=$(realpath ./std/rpmbuild/RPMS/kernel-debuginfo-3.10.0-957.12.2.el7.2.tis.x86_64.rpm)
    VMLINUX=$(rpm2cpio ${KDBG_RPM} | cpio -t | grep vmlinux)
    cd /tmp
    rpm2cpio ${KDBG_RPM} | cpio -idv ${VMLINUX}
    gdb --batch --command=/dev/stdin ${VMLINUX} <<...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding the stx.2.0 label since this was marked as high priority by the distro.other PL

tags: added: stx.2.0
Changed in starlingx:
status: New → Triaged
Revision history for this message
Bin Yang (byangintel) wrote :

Hi Daniel,
    I think you are right.

thanks,
Bin

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672625

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Bart Wensley (bartwensley) wrote :

Note that I saw this again on the same AIO-DX hardware lab in Ottawa (wolfpass 1-2). The oops has the same traceback and happened after the controller had been up for only 4 hours.

Revision history for this message
Bin Yang (byangintel) wrote :

Based on the code review, it is an obvious problem which was introduced by race condition.

The epi is removed from list by list_del_rcu(&epi->fllink) under list_for_each_entry_rcu() without rcu_read_lock. If the rcu grace-period thread free epi before next list_for_each loop, the content of epi will be corrupted.

Normally, the rcu read should be protected by rcu_read_lock. For performance optimization, it does not hold the lock here. But it removes the item from list and call_rcu() to free the pointer. The rcu task is running in background with about (HZ/10). The gap between call_rcu() and next loop is only a few C lines without sleep/sched. So it is very hard to reproduce it.

I think this is the reason that this patch is not in original Centos kernel. I checked it in https://git.centos.org/rpms/kernel/raw/c7/f/SPECS/kernel.spec, and did not found this patch currently.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/672625
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=853388e615b6fb4fbec7fded3eb2652fe0120527
Submitter: Zuul
Branch: master

commit 853388e615b6fb4fbec7fded3eb2652fe0120527
Author: Bin Yang <email address hidden>
Date: Wed Jul 31 14:12:05 2019 +0800

    epoll: fix use-after-free in eventpoll_release_file

    back port upstream patch
    https://github.com/torvalds/linux/commit/ebe06187bf2aec10d537ce4595e416035367d703

    the epi is removed from list by list_del_rcu(&epi->fllink);
    under list_for_each_entry_rcu() without rcu_read_lock.

    if the rcu grace-period thread free epi before next list_for_each loop,
    the content of epi will be corrupted.

    Change-Id: I75dbf8ada5ca4734761efe260ca6d6f85886b180
    Closes-Bug: 1837430
    Suggested-by: <email address hidden>
    Signed-off-by: Bin Yang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.