standby controller node was stuck during reboot procedure when SM failed

Bug #2041606 reported by Zhixiong Chi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Zhixiong Chi

Bug Description

Brief Description
-----------------
On the Duplex system for Dell PowerEdge R750 machine, standby controller-0 was rebooted due to SM failure , then it was stuck during reboot procedure after the activated controller node sends the reboot command to mtcClient.
The activated controller will send reboot command to mtcClent on the standby controller due to the SM failure(heartbeat missed), and mtcClient tries to reboot the system gracefully. But if the standby controller isn't rebooted within 120s, mtcClient tries to force reboot it using the following command
"echo b > /proc/sysrq-trigger". Unfortunately the machine Dell PowerEdge R750 is stuck as the BMC console doesn't show anything.

Severity
--------
Major

Steps to Reproduce
------------------
(1) Stop SM heartbeat on standby controller (controller-0) using the below commands.

sudo bash
date; cat /var/run/hbsAgent.pid | xargs kill -SIGSTOP; cat /var/run/sm.pid | xargs kill -SIGSTOP

(2) Then mtcAgent on controller-1 sends reboot command to mtcClient on controller-0, and mtcClient tries to reboot controller-0 gracefully. But the controller isn't rebooted within 120s, so mtcClient tries to force reboot it using "echo b > /proc/sysrq-trigger", but it is stuck(We can see 'sysrq: Resetting' in the console.)

Or
Execute execute the following command 'sudo -i; echo b > /proc/sysrq-trigger' then check if the system can reboot properly every time.

Expected Behavior
------------------
he system can reboot properly every time.

Actual Behavior
----------------
Sometimes the system hang, reboot failed.

Reproducibility
---------------
Not 100%

System Configuration
--------------------
AIO-DX on Dell PowerEdge R750 machine.

Branch/Pull Time/Commit
-----------------------
N/A

Last Pass
---------

Timestamp/Logs
--------------

Test Activity
-------------

Workaround
----------
Edit /boot/1/kernel.env or /boot/efi/EFI/BOOT/boot.env to add the kernel option 'reboot=p'.

Changed in starlingx:
assignee: nobody → Zhixiong Chi (zhixiongchi)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kernel (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/kernel/+/899561

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/899561
Committed: https://opendev.org/starlingx/kernel/commit/134d5d2fbd2a1207074419722dfa903d4be2f328
Submitter: "Zuul (22348)"
Branch: master

commit 134d5d2fbd2a1207074419722dfa903d4be2f328
Author: Zhixiong Chi <email address hidden>
Date: Fri Oct 27 03:10:12 2023 -0700

    Add the pci reboot quirk in DMI table for Dell PowerEdge R750

    Problem:
    The Dell R750 will hang after the following command being executed:
    $sudo -i /bin/bash -c 'echo b > /proc/sysrq-trigger'
    This issue can be reproduced almost within 5 times testing cycle.

    The activated controller will send reboot command to mtcClient on the
    standby controller due to the SM failure(heartbeat missed), and then
    mtcClient tries to reboot the system gracefully. But if the standby
    controller isn't rebooted within 120s, mtcClient tries to force reboot
    it using the following command "echo b > /proc/sysrq-trigger".
    Unfortunately the machine Dell PowerEdge R750 is stuck and the BMC
    console doesn't show anything.

    Solution:
    After searching if there is any revelant clues about this machine,
    nothing was found but the kernel parameter 'reboot=p' to change the
    reboot type to pci_reboot for the sysrq magic key. With doing the test
    cycle multiple times, and the issue has been gone with the kernel
    option. The behavior that the system can reboot properly is expected.
    So this way should be helpful for the Dell R750 reset.
    Considering this kernel option should not be applicable to all target
    machines, we just adjust the method to change reboot type for R750
    machine based on DMI table quirk. The other kind of machine still uses
    the default reboot type, and this commit just affects the R750 machine.

    Base on the above, we add the pci reboot quirk in DMI table to change
    the reboot_type to pci_reboot to make sure the kernel On Dell PowerEdge
    R750 reboot properly.

    On the R750 target we can see the following dmidecode information:
    $sudo dmidecode |grep 'Product Name'
            Product Name: PowerEdge R750
    $sudo dmidecode |grep 'Vendor'
            Vendor: Dell Inc.

    TestPlan:
    PASS: downloader && build-pkgs && build-image
    PASS: Jenkins Installation on R750 machine and the other labs.
    PASS: Execute the following testing cycle more than 20 times:
           $sudo -i /bin/bash -c 'echo b > /proc/sysrq-trigger'
           The system can reboot properly every time during test cycles.
           The stuck issue after reset hasn't been seen anymore.

    Closes-Bug: 2041606

    Signed-off-by: Zhixiong Chi <email address hidden>
    Change-Id: I05467cc6d5105aa813852dca0c935278741b043f

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.9.0 stx.distro.other stx.kernel
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.