Comment 16 for bug 1930361

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/802351
Committed: https://opendev.org/openstack/masakari-monitors/commit/9ae886e7428e61dfc6a29ec65b0f6836d2648326
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 9ae886e7428e61dfc6a29ec65b0f6836d2648326
Author: sue <sugar-2008@163.com>
Date: Wed Jun 2 16:38:05 2021 +0800

    Fix hostmonitor hanging forever after certain exceptions

    The hostmonitor, like other Masakari monitors, starts as an
    Oslo service (based on eventlet). The main thread is supposed
    to run a loop that has an internal wait mechanism (instead of
    reusing periodic_tasks from oslo_service). However, the loop
    could be broken, if an unexpected exception appeared, and it
    never ran again but the process was still alive (due to
    oslo_service not stopping). The example mentioned in the bug
    report is about unavailability of the Masakari API (and/or
    Keystone API) before notification sending. This exception is
    not caught early because SendNotification._make_client is
    called outside of the try block (unlike the actual notification
    sending). The exception bubbles up and stops the main loop,
    leaving a useless hostmonitor process. The user is unaware
    unless they notice the logs are no longer growing.

    While the general design begs for a revamp (we might get away
    with that by using Consul in the first place), the easy fix is
    to prevent exceptions breaking the loop completely so that the
    hostmonitor can continue to work and try to regain health.
    At the very least it will keep posting ERROR messages in the log
    which is more likely to be spotted in comparison to lack of logs
    (which is, unfortunately, less commonly considered an alerting
    situation).

    This change also fixes, adapts and robustifies the two relevant
    unit tests.

    Closes-Bug: #1930361
    Co-Authored-By: RadosÅ‚aw Piliszek <email address hidden>
    Change-Id: I7e3447dcddc7998e3e3c30f4f0019d91a99c79ce
    (cherry picked from commit e7154f3d77ee4c06eec603a850ec941668eb602f)