pmon recovered failed process during unlock

Bug #1883519 reported by Eric MacDonald
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
Pmon recovered mtcClient during the unlock shutdown of AIO SX which indirectly cancelled the fail-safe sysreq reboot thread. In this case the shutdown did not complete and with the fail-Safe thread cancelled the AIO SX did not reboot over the unlock.

Severity
--------
Major: with manual workaround system is still usable

Steps to Reproduce
------------------
lock/unlock AIO SX

Expected Behavior
------------------
AIO SX reboots

Actual Behavior
----------------
AIO SX did not reboot

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
Very rare. Shutdown needs to not complete at all.

System Configuration
--------------------
AIO SX

Branch/Pull Time/Commit
-----------------------
Current StarlingX master branch context at time this issue was created.

Last Pass
---------
Issue never observed before.

Timestamp/Logs
--------------
2020-06-09T20:30:37.021 [83228.00298] controller-0 pmond com nodeUtil.cpp (1864) get_system_state : Warn : systemctl is-system-running yielded no response

Test Activity
-------------
Developer Testing

Workaround
----------
manual reboot

Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - workaround exists

Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Eric MacDonald (rocksolidmtce)
tags: added: stx.4.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/735609
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=e379fdfe189f067445daeac69e8997192c8f0aed
Submitter: Zuul
Branch: master

commit e379fdfe189f067445daeac69e8997192c8f0aed
Author: Eric MacDonald <email address hidden>
Date: Mon Jun 15 11:09:47 2020 -0400

    Prevent pmond process recovery when system is not running

    The maintenance process monitor (pmon) should only
    recover failed processes when the system state is
    'running' or 'degraded'.

    The current implementation allowed process recovery
    for other non-inservice states, including an unknown
    state if systemd returns no data on the state query.

    This update tighten's up the system state check by
    adding retries to the state query utility and
    restricting accepted states to 'running' and 'degraded'.

    This change then prevents pmon from inadvertently killing
    and recovering the mtcClient which indirectly kills off
    the mtcClient's fail-safe sysreq reboot child thread
    if pmon state query returns anything other than running
    or degraded during a shut down.

    Change-Id: I605ae8be06f8f8351a51afce98a4f8bae54a40fd
    Closes-Bug: 1883519
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/745764

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/745764
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=7f6cd7ae3aaa81b11deab9a7efc52c8ddfd47ba1
Submitter: Zuul
Branch: master

commit 7f6cd7ae3aaa81b11deab9a7efc52c8ddfd47ba1
Author: Eric MacDonald <email address hidden>
Date: Tue Aug 11 11:30:53 2020 -0400

    Stop the process monitor (pmond) on controlled self-reboot

    There are still cases seen where on an AIO SX unlock operation
    fails to reboot due to pmond recovering the mtcClient following
    a mtcClient self-reboot and launch of fail-safe sysreq reset thread.

    Following a self-reboot, the Process Monitor (pmond) detects an active
    monitoring failure of the mtcClient. However, at that same time
    systemctl reports that the system is running degraded, not stopping.

    So the previous fix to pmon does not know that the system is stopping
    so it restarts mtcClient ; like before but valid systemctl state
    readout.

    This update is a further enhancement for the issue reported by
    https://bugs.launchpad.net/starlingx/+bug/1883519 with update
    https://review.opendev.org/#/c/735609 by commanding the mtcClient
    to stop pmond, with verification and retries, immediately before
    a self-reboot.

    Change-Id: I17fde797803c537f4f448b4764585f1f1acc4e2a
    Closes-Bug: 1883519
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

still seeing issues with AIO-SX unlocks on stx master; re-opening

Changed in starlingx:
status: Fix Released → Confirmed
tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/759322

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/759322
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=c62ce9f09b68cc3064c9bb75e267ef012b59c112
Submitter: Zuul
Branch: master

commit c62ce9f09b68cc3064c9bb75e267ef012b59c112
Author: Eric MacDonald <email address hidden>
Date: Thu Oct 22 17:00:05 2020 -0400

    Fix AIO SX Lazy Reboot race condition

    An race condition was found in the mtcClient's current
    lazy reboot handling. The race condition is removed by
    stopping pmond before entering lazy reboot wait loop.

    A further enhancement was added to the mtcClient startup
    process to detect if it had been restarted while acting
    on a reboot request. In that unlikely case, the mtcClient
    will now resume that action to ensure that the required
    reboot occurs.

    Change-Id: Ide1ba979c32e1d2005410e42ef6e0e14f10f5cb0
    Closes-Bug: 1883519
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.