Comment 4 for bug 1894889

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/762577
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=2fc05673d1a63fa5834c9722c5d53d6e8934a9e3
Submitter: Zuul
Branch: master

commit 2fc05673d1a63fa5834c9722c5d53d6e8934a9e3
Author: Eric MacDonald <email address hidden>
Date: Thu Nov 12 15:39:52 2020 -0500

    Add SysRq crash dump support for pmon quorum health messaging loss

    The hostwd process supports failure handling for two pmon
    quorum failure modes.
     1. persistent pmon quorum process failure
     2. persistent absence of pmon's quorum health report

    This update adds a new configuration option and associated
    implementation required to force a crash dump action for
    failure mode 2 above.

    This means that if the Process Monitor itself gets stalled or stops
    running for 3 (default config) minutes then the hostwd will trigger
    a SysRq to force a crash dump.

    Test Plan:

    PASS: Verify kdump for pmon quorum health report message loss
    PASS: Verify no kdump when kdump_on_stall is disabled
    PASS: Verify handling when kdump service is not active
    PASS: Verify sighup config change detection and handling

    Regression:

    PASS: Verify softdog timeout handling and logs
    PASS: Verify quorum threshold config change and handling
    PASS: Verify handling with reboot/reset recovery methods disabled
    PASS: Verify enable reboot_on_err config change handling
    PASS: Verify reboot/reset actions are ignored while host is locked
    PASS: Verify pmon failure recovery handling before threshold reached

    Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1
    Partial-Bug: 1894889
    Depends-On: https://review.opendev.org/#/c/750806
    Signed-off-by: Eric MacDonald <email address hidden>