RR patch failed when trying to lock a host because of a failed live-migration/stop instance

Bug #1960833 reported by Heitor Matsui
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Heitor Matsui

Bug Description

Brief Description
-----------------
RR patch could not be removed because of an error when trying to lock a host: Lock of host(s) compute-1 failed because instance(s) tenant2-virtio26 were not migrated or stopped.

Severity
-----------------
Major

Steps to Reproduce
-----------------
Launch 60 VMs on DX+
Apply RR patch using orchestration (Alarm: relaxed, VM: migrate)
Remove RR patch using orchestration (same settings)
Verify that RR patch operation fails as couldn't migrate a VM

Expected Behavior
-----------------
Patches application/removal should be carried out with no problems.

Actual Behavior
-----------------
RR patch removal failed because a host could not be locked as a VM did not migrate/stop.

Reproducibility
-----------------
Seems to be intermittent but migration failures are frequently happening. Developer's patch and RR patch applied successfully, but RR patch removal failed.

System Configuration
-----------------
DX+ workers

Branch/Pull Time/Commit
-----------------
2021-06-09_18-58-11

Last Pass
-----------------
Not applicable

Timestamp/Logs
-----------------
Moment of failure:

log-id = 29
event-id = sw-patch-auto-apply-failed
event-type = action-event
event-context = admin
importance = high
entity = orchestration=sw-patch
reason_text = Software patch auto-apply failed, reason = Lock of host(s) compute-1 failed because instance(s) tenant2-virtio26 were not migrated or stopped
additional_text =
timestamp = 2021-08-10 16:53:49.149041
Last successful tenant2-virtio26 migration:

2021-08-10 16:28:27.501378 \N \N 3538 c4cf6f2a-445b-4b2a-ab55-695ed24b33e5 700.156 log tenant.instance tenant=5fbc1d9a-ddc3-49f1-bcce-5d48a1a399d6.instance=318cfd89-6425-4dad-b678-cfea2509ef04 2021-08-10 16:28:27.460548 critical Live-Migrate complete for instance tenant2-virtio26 now enabled on host compute-1 equipment unspecified-reason f t \N

Test Activity
-----------------
Developer testing

Workaround
-----------------
Not applicable

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/825879
Committed: https://opendev.org/starlingx/nfv/commit/1c4e0484659f5b98c09485e135fd02386eee2dac
Submitter: "Zuul (22348)"
Branch: master

commit 1c4e0484659f5b98c09485e135fd02386eee2dac
Author: Heitor Matsui <email address hidden>
Date: Fri Jan 21 17:14:08 2022 -0300

    Add migrate steps for hosts without instances

    During the patch strategy creation the migrate-instances step
    only happens for hosts who have instances running at that moment.
    As a consequence, if an instance is migrated, during patching
    operation, to a host that didn't have any instances running
    previously, the patch operation will fail as it will try to lock
    the host directly, without migrating its instances previously.
    This issue can happen either during patch application or removal.

    This commit changes the patching build strategy adding the
    migrate-instances-from-host step that will be applied to all
    worker hosts unconditionally (given they are OpenStack compute
    nodes), and because the previous step (migrate-instances) was built
    for a list of instances, some implementations had to take place to
    allow building it for a list of hosts.

    Test Plan
    PASS: serial patch application runs successfully outside
          Openstack context;
    PASS: parallel patch application runs successfully outside
          Openstack context;
    PASS: serial patch application runs successfully with a host
          not having instances before patch operation begins and
          having an instance migrated to it during patch application;
    PASS: parallel patch application runs successfully with a host
          not having instances before patch operation begins and
          having an instance migrated to it during patch application;

    Closes-bug: 1960833
    Change-Id: I99675ea0b5d0c75bc84c78864b118debc265ceb4
    Signed-off-by: Heitor Matsui <email address hidden>
    Co-authored-by: Rafael Falcão <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Heitor Matsui (heitormatsui)
importance: Undecided → Medium
tags: added: stx.7.0 stx.nfv
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.