Mtce host watchdog too long to detect controller overload

Bug #1894889 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

The current maintenance host watchdog timeout is set at 5 minutes and its minimum setting is 1 minute.
5 minutes, and even 1 minute is too long to detect and react to overload conditions that cause non-rt process stalls of over 15 seconds. Need to detect the failure and force self recovery by reset somewhere in the 10-20 second time frame.

This is of particular concern for the Active/Standby System Controllers.

Need the mtce host watchdog to support watchdog timeout interval in the 10 or 10's of seconds time frame, not just 5 minutes.

Severity
--------
Major: Need the mtce host watchdog to support watchdog timeout interval in the 10 or 10's of seconds, not just 5 minutes.

Steps to Reproduce
------------------
Change the /etc/mtc/hostwd.conf:kernwd_update_period = 12

Expected Behavior
------------------
hostwd continues to support pmon quorum monitoring and modified kernel watchdog period with a simple hostwd.conf:kernwd_update_period file:label change.

Actual Behavior
----------------
The hostwd does not permit timeout settings less than 60 seconds.
The pmon quorum monitoring would fail for a host watchdog setting in the 10-12 second range due to current implementation.

Reproducibility
---------------
Reproducible: 100%

System Configuration
--------------------
any

Branch/Pull Time/Commit
-----------------------
all

Last Pass
---------
N/A since the hostwd process has been this way since its introduction.

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Feature development

Workaround
----------
None

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - issue w/ recovery handling

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Eric MacDonald (rocksolidmtce)
tags: added: stx.5.0 stx.metal
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/762577

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/750806
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=3a6fec50c15153deeed2b85fd58da6470d50914c
Submitter: Zuul
Branch: master

commit 3a6fec50c15153deeed2b85fd58da6470d50914c
Author: Eric MacDonald <email address hidden>
Date: Wed Sep 9 15:39:48 2020 -0400

    Reduce Maintenance Host Watchdog timeout for controllers

    This update makes changes to the maintenance host watchdog
    and reduces the timeout from 5 to 3 minutes for controllers.

    This update also decouples the pmon quorum monitoring
    feature handling from the host watchdog timeout. Both were
    driven off the same select timer which prevented watchdog
    timeout value to be independently changed without affecting
    quorum monitoring.

    A new config label 'kernwd_update_period_stall_detect' is
    added and value loaded for hosts that need more rigid
    process stall detection.

    This new lower timeout value label is loaded and applied to
    hosts that run the system controller function.

    A few logging improvements were made.

    Test Plan:

    PASS: Verify pmon quorum failure handling while unlocked.
                 Was and remains at 3 misses, 60 seconds each.
    PASS: Verify watchdog TO at 12 seconds on controllers.
                 Was 300 secs.
    PASS: Verify kernel watchdog is not enabled when loaded
                 kernwd_update_period is less than 5 seconds.
                 Was 60 secs.
    PASS: Verify process logging ; startup, failure, transient
    PASS: Verify all config values loaded by hostwd process

    Regression:

    PASS: Verify watchdog TO at 300 seconds on non-controllers
    PASS: Verify handling of failed quorum process while locked
    PASS: Verify handling of failed quorum process while unlocked
    PASS: Verify handling of transient quorum messaging loss while
                 unlocked
    PASS: Verify hostwd process patching ; locked and unlocked
                 cases

    PASS: Verify AIO DX System Install
    PASS: Verify Standard System Install

    Note: There is no kernel WD TO log.
          The log is output to the console.

    Change-Id: Iad726436e28dfa48a06743aa166318969eb6915d
    Closes-Bug: #1894889
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/762577
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=2fc05673d1a63fa5834c9722c5d53d6e8934a9e3
Submitter: Zuul
Branch: master

commit 2fc05673d1a63fa5834c9722c5d53d6e8934a9e3
Author: Eric MacDonald <email address hidden>
Date: Thu Nov 12 15:39:52 2020 -0500

    Add SysRq crash dump support for pmon quorum health messaging loss

    The hostwd process supports failure handling for two pmon
    quorum failure modes.
     1. persistent pmon quorum process failure
     2. persistent absence of pmon's quorum health report

    This update adds a new configuration option and associated
    implementation required to force a crash dump action for
    failure mode 2 above.

    This means that if the Process Monitor itself gets stalled or stops
    running for 3 (default config) minutes then the hostwd will trigger
    a SysRq to force a crash dump.

    Test Plan:

    PASS: Verify kdump for pmon quorum health report message loss
    PASS: Verify no kdump when kdump_on_stall is disabled
    PASS: Verify handling when kdump service is not active
    PASS: Verify sighup config change detection and handling

    Regression:

    PASS: Verify softdog timeout handling and logs
    PASS: Verify quorum threshold config change and handling
    PASS: Verify handling with reboot/reset recovery methods disabled
    PASS: Verify enable reboot_on_err config change handling
    PASS: Verify reboot/reset actions are ignored while host is locked
    PASS: Verify pmon failure recovery handling before threshold reached

    Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1
    Partial-Bug: 1894889
    Depends-On: https://review.opendev.org/#/c/750806
    Signed-off-by: Eric MacDonald <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/762945

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/762945
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=168c862c72a52bc6658dfdef478a8199754cb32e
Submitter: Zuul
Branch: master

commit 168c862c72a52bc6658dfdef478a8199754cb32e
Author: Eric MacDonald <email address hidden>
Date: Mon Nov 16 21:31:20 2020 -0500

    Add SysRq crash dump support for pmon quorum process failure

    This update adds 'pmon quorum process failure handling'
    to the list of Host Watchdog failure modes that trigger
    a crash dump.

    Change-Id: If8632dbe30ea290663177181a42785853ee808d3
    Partial-Bug: 1894889
    Depends-On: https://review.opendev.org/#/c/762577
    Signed-off-by: Eric MacDonald <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.