hostmonitor hangs after notifications send failed

Bug #1930361 reported by suzhengwei
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
masakari-monitors
Fix Released
Critical
suzhengwei
Ussuri
Fix Committed
Critical
Unassigned
Victoria
Fix Committed
Critical
Unassigned
Wallaby
Fix Committed
Critical
Unassigned
Xena
Fix Released
Critical
suzhengwei
masakari-monitors (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

In an env, we found one hostmonitor didn't log anymore after send host failure notification failed.

I noticed that in the monitor_hosts it will exit if once it catch some exception. So there is risk, that if one host down later, no recovery will be triggered.

See comment #5 for a detailed analysis.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

In Kolla Ansible we workaround such issues by ensuring the container is set to restart automatically on failures.

I agree it should be circumvented at the process level as well.

Changed in masakari-monitors:
importance: Undecided → Medium
status: New → Triaged
suzhengwei (sue.sam)
description: updated
Changed in masakari-monitors:
assignee: nobody → suzhengwei (sue.sam)
importance: Medium → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (master)
Changed in masakari-monitors:
status: Triaged → In Progress
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Please provide example error stacks. What does fail in there?

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

From the irc meeting:

08:57:10 <suzhengwei_> It is easily to produce. While keystone or masakari-api out of service, trigger one host failure.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

/triaged
/critical

The hostmonitor, like other Masakari monitors, starts as an Oslo service (based on eventlet). The main thread is supposed to run a loop that has an internal wait mechanism (instead of reusing periodic_tasks from oslo_service). However, the loop can be broken, if an unexpected exception appears, and it never runs again while the process is still alive (due to oslo_service not stopping). The example mentioned here is about unavailability of the Masakari API (and/or Keystone API) before notification sending. This exception is not caught early because SendNotification._make_client is called outside of the try block (unlike the actual notification sending). The exception bubbles up and stops the main loop, leaving a useless hostmonitor process. The user is unaware unless they notice the logs are no longer growing. Hence, this is a critical issue.

While the general design begs for a revamp (we might get away with that by using Consul in the first place), the easy fix is to prevent exceptions breaking the loop completely so that the hostmonitor can continue to work and try to regain health. At the very least it will keep posting ERROR messages in the log which is more likely to be spotted in comparison to lack of logs (which is less commonly considered an alerting situation but should be).

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/802348

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/802349

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/802350

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (master)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/794162
Committed: https://opendev.org/openstack/masakari-monitors/commit/e7154f3d77ee4c06eec603a850ec941668eb602f
Submitter: "Zuul (22348)"
Branch: master

commit e7154f3d77ee4c06eec603a850ec941668eb602f
Author: sue <sugar-2008@163.com>
Date: Wed Jun 2 16:38:05 2021 +0800

    Fix hostmonitor hanging forever after certain exceptions

    The hostmonitor, like other Masakari monitors, starts as an
    Oslo service (based on eventlet). The main thread is supposed
    to run a loop that has an internal wait mechanism (instead of
    reusing periodic_tasks from oslo_service). However, the loop
    could be broken, if an unexpected exception appeared, and it
    never ran again but the process was still alive (due to
    oslo_service not stopping). The example mentioned in the bug
    report is about unavailability of the Masakari API (and/or
    Keystone API) before notification sending. This exception is
    not caught early because SendNotification._make_client is
    called outside of the try block (unlike the actual notification
    sending). The exception bubbles up and stops the main loop,
    leaving a useless hostmonitor process. The user is unaware
    unless they notice the logs are no longer growing.

    While the general design begs for a revamp (we might get away
    with that by using Consul in the first place), the easy fix is
    to prevent exceptions breaking the loop completely so that the
    hostmonitor can continue to work and try to regain health.
    At the very least it will keep posting ERROR messages in the log
    which is more likely to be spotted in comparison to lack of logs
    (which is, unfortunately, less commonly considered an alerting
    situation).

    This change also fixes, adapts and robustifies the two relevant
    unit tests.

    Closes-Bug: #1930361
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Change-Id: I7e3447dcddc7998e3e3c30f4f0019d91a99c79ce

Changed in masakari-monitors:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/802351

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Hey Canonical OpenStack team, assigning this bug to you as well to let you know there will be a relatively important patch to be applied. I will be making some relevant releases soon.

Changed in masakari-monitors (Ubuntu):
status: New → Confirmed
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

BTW, RCA in comment #5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/802348
Committed: https://opendev.org/openstack/masakari-monitors/commit/a981e0df31805b2cc3feb0a795e5d6cb2cd70c88
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit a981e0df31805b2cc3feb0a795e5d6cb2cd70c88
Author: sue <sugar-2008@163.com>
Date: Wed Jun 2 16:38:05 2021 +0800

    Fix hostmonitor hanging forever after certain exceptions

    The hostmonitor, like other Masakari monitors, starts as an
    Oslo service (based on eventlet). The main thread is supposed
    to run a loop that has an internal wait mechanism (instead of
    reusing periodic_tasks from oslo_service). However, the loop
    could be broken, if an unexpected exception appeared, and it
    never ran again but the process was still alive (due to
    oslo_service not stopping). The example mentioned in the bug
    report is about unavailability of the Masakari API (and/or
    Keystone API) before notification sending. This exception is
    not caught early because SendNotification._make_client is
    called outside of the try block (unlike the actual notification
    sending). The exception bubbles up and stops the main loop,
    leaving a useless hostmonitor process. The user is unaware
    unless they notice the logs are no longer growing.

    While the general design begs for a revamp (we might get away
    with that by using Consul in the first place), the easy fix is
    to prevent exceptions breaking the loop completely so that the
    hostmonitor can continue to work and try to regain health.
    At the very least it will keep posting ERROR messages in the log
    which is more likely to be spotted in comparison to lack of logs
    (which is, unfortunately, less commonly considered an alerting
    situation).

    This change also fixes, adapts and robustifies the two relevant
    unit tests.

    Closes-Bug: #1930361
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Change-Id: I7e3447dcddc7998e3e3c30f4f0019d91a99c79ce
    (cherry picked from commit e7154f3d77ee4c06eec603a850ec941668eb602f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/802349
Committed: https://opendev.org/openstack/masakari-monitors/commit/020e13e04eac6dae2884787a3844cd045b9f166d
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 020e13e04eac6dae2884787a3844cd045b9f166d
Author: sue <sugar-2008@163.com>
Date: Wed Jun 2 16:38:05 2021 +0800

    Fix hostmonitor hanging forever after certain exceptions

    The hostmonitor, like other Masakari monitors, starts as an
    Oslo service (based on eventlet). The main thread is supposed
    to run a loop that has an internal wait mechanism (instead of
    reusing periodic_tasks from oslo_service). However, the loop
    could be broken, if an unexpected exception appeared, and it
    never ran again but the process was still alive (due to
    oslo_service not stopping). The example mentioned in the bug
    report is about unavailability of the Masakari API (and/or
    Keystone API) before notification sending. This exception is
    not caught early because SendNotification._make_client is
    called outside of the try block (unlike the actual notification
    sending). The exception bubbles up and stops the main loop,
    leaving a useless hostmonitor process. The user is unaware
    unless they notice the logs are no longer growing.

    While the general design begs for a revamp (we might get away
    with that by using Consul in the first place), the easy fix is
    to prevent exceptions breaking the loop completely so that the
    hostmonitor can continue to work and try to regain health.
    At the very least it will keep posting ERROR messages in the log
    which is more likely to be spotted in comparison to lack of logs
    (which is, unfortunately, less commonly considered an alerting
    situation).

    This change also fixes, adapts and robustifies the two relevant
    unit tests.

    Closes-Bug: #1930361
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Change-Id: I7e3447dcddc7998e3e3c30f4f0019d91a99c79ce
    (cherry picked from commit e7154f3d77ee4c06eec603a850ec941668eb602f)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/train)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/802350
Committed: https://opendev.org/openstack/masakari-monitors/commit/ad90db8811e8da41e6b5218ef3127b7562381e51
Submitter: "Zuul (22348)"
Branch: stable/train

commit ad90db8811e8da41e6b5218ef3127b7562381e51
Author: sue <sugar-2008@163.com>
Date: Wed Jun 2 16:38:05 2021 +0800

    Fix hostmonitor hanging forever after certain exceptions

    The hostmonitor, like other Masakari monitors, starts as an
    Oslo service (based on eventlet). The main thread is supposed
    to run a loop that has an internal wait mechanism (instead of
    reusing periodic_tasks from oslo_service). However, the loop
    could be broken, if an unexpected exception appeared, and it
    never ran again but the process was still alive (due to
    oslo_service not stopping). The example mentioned in the bug
    report is about unavailability of the Masakari API (and/or
    Keystone API) before notification sending. This exception is
    not caught early because SendNotification._make_client is
    called outside of the try block (unlike the actual notification
    sending). The exception bubbles up and stops the main loop,
    leaving a useless hostmonitor process. The user is unaware
    unless they notice the logs are no longer growing.

    While the general design begs for a revamp (we might get away
    with that by using Consul in the first place), the easy fix is
    to prevent exceptions breaking the loop completely so that the
    hostmonitor can continue to work and try to regain health.
    At the very least it will keep posting ERROR messages in the log
    which is more likely to be spotted in comparison to lack of logs
    (which is, unfortunately, less commonly considered an alerting
    situation).

    This change also fixes, adapts and robustifies the two relevant
    unit tests.

    Closes-Bug: #1930361
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Change-Id: I7e3447dcddc7998e3e3c30f4f0019d91a99c79ce
    (cherry picked from commit e7154f3d77ee4c06eec603a850ec941668eb602f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/802351
Committed: https://opendev.org/openstack/masakari-monitors/commit/9ae886e7428e61dfc6a29ec65b0f6836d2648326
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 9ae886e7428e61dfc6a29ec65b0f6836d2648326
Author: sue <sugar-2008@163.com>
Date: Wed Jun 2 16:38:05 2021 +0800

    Fix hostmonitor hanging forever after certain exceptions

    The hostmonitor, like other Masakari monitors, starts as an
    Oslo service (based on eventlet). The main thread is supposed
    to run a loop that has an internal wait mechanism (instead of
    reusing periodic_tasks from oslo_service). However, the loop
    could be broken, if an unexpected exception appeared, and it
    never ran again but the process was still alive (due to
    oslo_service not stopping). The example mentioned in the bug
    report is about unavailability of the Masakari API (and/or
    Keystone API) before notification sending. This exception is
    not caught early because SendNotification._make_client is
    called outside of the try block (unlike the actual notification
    sending). The exception bubbles up and stops the main loop,
    leaving a useless hostmonitor process. The user is unaware
    unless they notice the logs are no longer growing.

    While the general design begs for a revamp (we might get away
    with that by using Consul in the first place), the easy fix is
    to prevent exceptions breaking the loop completely so that the
    hostmonitor can continue to work and try to regain health.
    At the very least it will keep posting ERROR messages in the log
    which is more likely to be spotted in comparison to lack of logs
    (which is, unfortunately, less commonly considered an alerting
    situation).

    This change also fixes, adapts and robustifies the two relevant
    unit tests.

    Closes-Bug: #1930361
    Co-Authored-By: Radosław Piliszek <email address hidden>
    Change-Id: I7e3447dcddc7998e3e3c30f4f0019d91a99c79ce
    (cherry picked from commit e7154f3d77ee4c06eec603a850ec941668eb602f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 9.0.2

This issue was fixed in the openstack/masakari-monitors 9.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 10.0.1

This issue was fixed in the openstack/masakari-monitors 10.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 11.0.1

This issue was fixed in the openstack/masakari-monitors 11.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 12.0.0.0rc1

This issue was fixed in the openstack/masakari-monitors 12.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors train-eol

This issue was fixed in the openstack/masakari-monitors train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.