VIM can mark all VMs in error state after a swact

Bug #1838810 reported by Frank Miller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bart Wensley

Bug Description

After a controller swact, if the newly active controller is busy a race condition can occur where VIM sets VMs to error state due to a logic bug in its audit.

Bart analyzed logs from such a scenario and determined that as mtce is coming up on the newly active controller there is a delay before it reports a compute host is enabled. If the audit runs at this time the audit only checks if the host state is enabled and if not enabled sets the VMs to error. The audit should instead check if the host is "disabled" as the host could be in "unknown" state for a short period of time after the swact when maintenance and VIM processes are starting up.

Revision history for this message
Frank Miller (sensfan22) wrote :

Setting priority to medium - while the race condition is low likelihood to occur, the impact is severe when it occurs since all VMs are marked in error.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Bart Wensley (bartwensley)
tags: added: stx.2.0 stx.nfv
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/676699

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (r/stx.2.0)

Reviewed: https://review.opendev.org/676699
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=ac00a68b22538fe41aa26cea95ad6b65772c1ba5
Submitter: Zuul
Branch: r/stx.2.0

commit ac00a68b22538fe41aa26cea95ad6b65772c1ba5
Author: Bart Wensley <email address hidden>
Date: Thu Aug 15 07:41:01 2019 -0500

    Correct VIM host audit criteria for failing instances

    The VIM's host audit will fail instances on any host that is
    "not enabled". That includes hosts where the operational state
    is unknown.

    Updating the check to ensure the host is "disabled" not that it
    is "not enabled" to avoid failing instances on a host where we
    don't know the operational state.

    Change-Id: I68d3e9f63695de721c10fb1dd2b7ac5917cb50fa
    Closes-Bug: 1838810
    Signed-off-by: Bart Wensley <email address hidden>
    (cherry picked from commit f98b388a745f95737bd4c05c24d59aed8b5e3699)

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Also merged in master via: https://review.opendev.org/#/c/676689/ on 2019-08-15
(Note: The wrong bug# was specified in this review. That's why this LP was not updated properly)

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.