StarlingX

Kernel crash vmcore controller reboot related to BIOS/Kernel CVE

Bug #2031597 reported by Jiping Ma on 2023-08-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Jiping Ma

Bug Description

Brief Description

Standby controller crashes with vmcore generated.

vmcore generated on controller-1 of Radio site 44895 running 21.12 Patch6 and Samsung R2.

Might be related to Spectre/Meltdown correction BIOS/Kernel CVE level of BIOS and Kernel. It happens when the pod/ containers using RT cores.
Information requested from Dell to get the BIOS/Kernel CVE level for Dell XR11 servers.
Interestingly after the incident we saw an excessive etc slowdown on controller-0 from the smu-eru database there is a steep drop in the free memory after 23:29:06.

Severity

Major: Service impacting reboot.

Steps to Reproduce

Experienced on Radio site running starlingx on controller-1.
More sites might be affected, to be updated in this JIRA as soon as this is discovered.

Expected Behavior

System should protect itself from rebooting and crash the application process that is causing the issue on the Kernel.

Actual Behavior

System crashes and vmcore generated.

Reproducibility

such vmcore seen 5 times
Free memory consumed at 2 Mega Bytes per seconds seen on another site on 88807 worker-1 and 96090 controller-1
etcd slow downs seen on almost all sites.

kernel BUG at kernel/locking/rtmutex.c:1331!
invalid opcode: 0000 [#1] PREEMPT_RT SMP NOPTI
......
Call Trace:
rt_spin_lock_slowlock_locked+0xb2/0x2a0
? update_load_avg+0x80/0x690
rt_spin_lock_slowlock+0x50/0x80
? update_load_avg+0x80/0x690
rt_spin_lock+0x2a/0x30
free_unref_page+0xc5/0x280
__vunmap+0x17f/0x240
put_task_stack+0xc6/0x130
__put_task_struct+0x3d/0x180
rt_mutex_adjust_prio_chain+0x365/0x7b0
task_blocks_on_rt_mutex+0x1eb/0x370
rt_spin_lock_slowlock_locked+0xb2/0x2a0
rt_spin_lock_slowlock+0x50/0x80
rt_spin_lock+0x2a/0x30
free_unref_page_list+0x128/0x5e0
release_pages+0x2b4/0x320
tlb_flush_mmu+0x44/0x150
tlb_finish_mmu+0x3c/0x70
zap_page_range+0x12a/0x170
? find_vma+0x16/0x70
do_madvise+0x99d/0xba0
? do_epoll_wait+0xa2/0xe0
? __x64_sys_madvise+0x26/0x30
__x64_sys_madvise+0x26/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tags:

Jiping Ma (jma11) on 2023-08-17

Changed in starlingx:
assignee:	nobody → Jiping Ma (jma11)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-08-17: Fix proposed to kernel (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/kernel/+/891652

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-08-22: Fix merged to kernel (master)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/891652
Committed: https://opendev.org/starlingx/kernel/commit/b541465cc394fb7ca765a452908f0077d6c33e80
Submitter: "Zuul (22348)"
Branch: master

commit b541465cc394fb7ca765a452908f0077d6c33e80
Author: Jiping Ma <email address hidden>
Date: Wed Aug 16 21:28:54 2023 -0700

kernel-rt: beware of __put_task_struct() calling context

Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
locks. Therefore, it can't be called from an non-preemptible context.

    Instead of calling __put_task_struct() directly, we defer it using
    call_rcu(). A more natural approach would use a workqueue, but since
    in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
    the code would become more complex because we would need to put the
    work_struct instance in the task_struct and initialize it when we
    allocate a new task_struct.

    We met 5 same panics, __put_task_struct is called during the process
    holding a lock that caused the kernel BUG_ON. The below is the call
    trace.

    We also need cherry pick the following commits, because the necessary
    context is not in 5.10.18x, such as there is not definition
    DEFINE_WAIT_OVERRIDE_MAP.

    * commit 5f2962401c6e
      ("locking/lockdep: Exclude local_lock_t from IRQ inversions")
    * commit 175b1a60e880
      ("locking/lockdep: Clean up check_redundant() a bit")
    * commit bc2dd71b2836
      ("locking/lockdep: Add a skip() function to __bfs()")
    * commit 0cce06ba859a
      ("debugobjects,locking: Annotate debug_object_fill_pool() wait type
       violation")

    kernel BUG at kernel/locking/rtmutex.c:1331!
    invalid opcode: 0000 [#1] PREEMPT_RT SMP NOPTI
    ......
    Call Trace:
     rt_spin_lock_slowlock_locked+0xb2/0x2a0
     ? update_load_avg+0x80/0x690
     rt_spin_lock_slowlock+0x50/0x80
     ? update_load_avg+0x80/0x690
     rt_spin_lock+0x2a/0x30
     free_unref_page+0xc5/0x280
     __vunmap+0x17f/0x240
     put_task_stack+0xc6/0x130
     __put_task_struct+0x3d/0x180
     rt_mutex_adjust_prio_chain+0x365/0x7b0
     task_blocks_on_rt_mutex+0x1eb/0x370
     rt_spin_lock_slowlock_locked+0xb2/0x2a0
     rt_spin_lock_slowlock+0x50/0x80
     rt_spin_lock+0x2a/0x30
     free_unref_page_list+0x128/0x5e0
     release_pages+0x2b4/0x320
     tlb_flush_mmu+0x44/0x150
     tlb_finish_mmu+0x3c/0x70
     zap_page_range+0x12a/0x170
     ? find_vma+0x16/0x70
     do_madvise+0x99d/0xba0
     ? do_epoll_wait+0xa2/0xe0
     ? __x64_sys_madvise+0x26/0x30
     __x64_sys_madvise+0x26/0x30
     do_syscall_64+0x33/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Verification:
    - build-pkgs; build-iso; install and boot up on aio-sx lab.
    - Can not reproduce the isue during the stress-ng test for almost 24 hours.
      while true; do sudo stress-ng --sched rr --mmapfork 23 -t 20; done
      while true; do sudo stress-ng --sched fifo--mmapfork 23 -t 20; done

    Closes-Bug: 2031597
    Signed-off-by: Jiping Ma <email address hidden>
    Change-Id: If022441d61492eaec88eede8603a6cb052af99d1

Reviewed:  https://review.opendev.org/c/starlingx/kernel/+/891652
Committed: https://opendev.org/starlingx/kernel/commit/b541465cc394fb7ca765a452908f0077d6c33e80
Submitter: "Zuul (22348)"
Branch:    master

commit b541465cc394fb7ca765a452908f0077d6c33e80
Author: Jiping Ma <jiping.ma2@windriver.com>
Date:   Wed Aug 16 21:28:54 2023 -0700

kernel-rt: beware of __put_task_struct() calling context
    
    Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
    locks. Therefore, it can't be called from an non-preemptible context.
    
    Instead of calling __put_task_struct() directly, we defer it using
    call_rcu(). A more natural approach would use a workqueue, but since
    in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
    the code would become more complex because we would need to put the
    work_struct instance in the task_struct and initialize it when we
    allocate a new task_struct.
    
    We met 5 same panics, __put_task_struct is called during the process
    holding a lock that caused the kernel BUG_ON. The below is the call
    trace.
    
    We also need cherry pick the following commits, because the necessary
    context is not in 5.10.18x, such as there is not definition
    DEFINE_WAIT_OVERRIDE_MAP.
    
    * commit 5f2962401c6e
      ("locking/lockdep: Exclude local_lock_t from IRQ inversions")
    * commit 175b1a60e880
      ("locking/lockdep: Clean up check_redundant() a bit")
    * commit bc2dd71b2836
      ("locking/lockdep: Add a skip() function to __bfs()")
    * commit 0cce06ba859a
      ("debugobjects,locking: Annotate debug_object_fill_pool() wait type
       violation")
    
    kernel BUG at kernel/locking/rtmutex.c:1331!
    invalid opcode: 0000 [#1] PREEMPT_RT SMP NOPTI
    ......
    Call Trace:
     rt_spin_lock_slowlock_locked+0xb2/0x2a0
     ? update_load_avg+0x80/0x690
     rt_spin_lock_slowlock+0x50/0x80
     ? update_load_avg+0x80/0x690
     rt_spin_lock+0x2a/0x30
     free_unref_page+0xc5/0x280
     __vunmap+0x17f/0x240
     put_task_stack+0xc6/0x130
     __put_task_struct+0x3d/0x180
     rt_mutex_adjust_prio_chain+0x365/0x7b0
     task_blocks_on_rt_mutex+0x1eb/0x370
     rt_spin_lock_slowlock_locked+0xb2/0x2a0
     rt_spin_lock_slowlock+0x50/0x80
     rt_spin_lock+0x2a/0x30
     free_unref_page_list+0x128/0x5e0
     release_pages+0x2b4/0x320
     tlb_flush_mmu+0x44/0x150
     tlb_finish_mmu+0x3c/0x70
     zap_page_range+0x12a/0x170
     ? find_vma+0x16/0x70
     do_madvise+0x99d/0xba0
     ? do_epoll_wait+0xa2/0xe0
     ? __x64_sys_madvise+0x26/0x30
     __x64_sys_madvise+0x26/0x30
     do_syscall_64+0x33/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    Verification:
    - build-pkgs; build-iso; install and boot up on aio-sx lab.
    - Can not reproduce the isue during the stress-ng test for almost 24 hours.
      while true; do sudo stress-ng --sched rr --mmapfork 23 -t 20; done
      while true; do sudo stress-ng --sched fifo--mmapfork 23 -t 20; done
    
    Closes-Bug: 2031597
    Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>
    Change-Id: If022441d61492eaec88eede8603a6cb052af99d1

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2023-08-24

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.9.0 stx.distro.other

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.