AIO-SX Low-latency: Watchdog fires while installing openstack

Bug #1832854 reported by Brent Rowsell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Alexander Kozyrev

Bug Description

Brief Description
-----------------
I attempted to install openstack on an AIO-SX with the low-latency profile. Towards the end of the installation, the load average went to 400+ lost terminal response and eventually the watchdog fired and rebooted the system.

Severity
--------
Critical+

Steps to Reproduce
------------------
See above

Expected Behavior
------------------
System does not implode

Actual Behavior
----------------
System implodes

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-SX low-latency

Branch/Pull Time/Commit
-----------------------
"2019-06-12 20:20:07 -0400"

Last Pass
---------
This worked on a June 4th load.

Timestamp/Logs
--------------
Will attach logs

Test Activity
-------------
Other

Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.2.0
description: updated
Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Issue including the high load average is not seen on a load built on June 4th.

Ghada Khalil (gkhalil)
tags: added: stx.distro.other
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Alex Kozyrev (akozyrev)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Cindy Xie (xxie1) wrote :

@Alex, any progress that you can share for this critical bug?

Revision history for this message
Alexander Kozyrev (akozyrev) wrote :

we finally got some kernel stacks from a dead system and analyzing them now.
Looks like all the platform CPUs seem to be trying to acquire spinlocks.
Backtraces are attached.

Revision history for this message
Alexander Kozyrev (akozyrev) wrote :

back-porting this commit form kernel mainline should cure the issue:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=c0ad4aa4d8416a39ad262a2bd68b30acd951bf0e
testing of this patch is in progress, for details refer to this thread:
https://<email address hidden>/T/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673028

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/673028
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=56a91fec13cd018f998a0bdf0beed753efd94df5
Submitter: Zuul
Branch: master

commit 56a91fec13cd018f998a0bdf0beed753efd94df5
Author: Alex Kozyrev <email address hidden>
Date: Fri Jul 26 13:58:17 2019 -0400

    Backport the fix for deadlock in CFS-bandwidth timer locking

    Low-latency profile of StarlingX is affected by a deadlock in
    CFS scheduler. spin_lock is used in IRQ handler there instead of
    spin_lock_irqsave. This leads to an attempt to lock the same
    spinlock twice and inevitable system freeze. Backporting c0ad4aa4d8
    commit from upstream kernel to cure the issue.

    Change-Id: I5416c0e0886f42d2bcec8e3e5da063e6af6916f8
    Closes-bug: 1832854
    Signed-off-by: Alex Kozyrev <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.