StarlingX

Bug #1883004
Comment #2

Comment 2 for bug 1883004

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-21: Fix merged to ha (master)

Reviewed: https://review.opendev.org/735219
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=630a777cbb894501cb019c917c1be8288e7a7c36
Submitter: Zuul
Branch: master

commit 630a777cbb894501cb019c917c1be8288e7a7c36
Author: Eric MacDonald <email address hidden>
Date: Thu Jun 11 15:32:47 2020 -0400

Add unhealthy state recovery audit to service management (sm)

    Service Management (SM) monitors connectivity and health of
    its peer controller over the OAM, Mgmt and (if provisioned)
    Cluster-Host networks.

    If SM sees all the links to its peer go 'carrier down' virtually
    simultaneously, it is possible that both controllers might
    simultaneously declare themselves unhealthy and both go
    disabled; i.e. shutdown all services with no automatic recovery.

    This update adds an 'Unhealthy State Recovery Audit' to SM which
    forces a self restart when all of its monitored links recover
    for cases where both controllers go unhealthy-shutdown or both
    controllers remain active in split-brain.

Test Plan:

    PASS: Verify AIO SX install
    PASS: Verify Standard system install and unhealthy state recovery
    PASS: Verify single link failure end to end behavior
    PASS: Verify 2 of 3 link failure end to end behavior
    PASS: Verify all link failure end to end behavior
    PASS: Verify SM and Mtce heartbeat recovery over unhealthy state recovery
    PASS: Verify swact back and forth following a recovery
    PASS: Verify process restart as part of unhealthy state recovery
    PASS: Verify AIO DX install and unhealthy state recovery

    Change-Id: Ie906eaf04bec607328b7e0af09b37fa0558e3bbe
    Closes-Bug: 1883004
    Signed-off-by: Eric MacDonald <email address hidden>