Upgrade: upgrade orchestration reported failed (host-unlock-failed) on controller-0

Bug #1946255 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
John Kung

Bug Description

Brief Description:
-----------------

Upgrading from stx5.0, controller-1 upgraded without issue. During orchestrated upgrade of controller-0/worker-0, controller-0 reported a "host-unlock-failed" error and orchestration entered abort state.
I connected to serial console and saw controller-0 was booting up. When complete, it was running N+1 software. worker-0 hadn't been touched at this point and was still running stx5. I deleted the upgrade-strategy and took a collect at this point.

Severity:
--------
Minor. Aside from optics and needed to create a new upgrade strategy, everything seemed fine.

Steps to Reproduce:
------------------
Perform an upgrade from stx5.0

Expected Behavior:
-----------------
No error reported and orchestration completes.

Actual Behavior:
---------------
See above

Reproducibility:
---------------
Intermittent. Seen once on internal system

System Configuration:
--------------------
CR: Standard config w/ 2 controllers and 1 worker. IPv6 and DC system controller.

Timestamp/Logs:
--------------
Upgrade orchestration applied at 2021-09-23T15:13:33.000
"host unlock failed" error reported in fm-event.log at 2021-09-23T15:29:39.000

Issue is due to a slightly early host-unlock attempt during the orchestration. sysinv sets the initial ‘inventoried’, which is required for a successful host-unlock, just after and nfv-vim host-unlock orchestration. Consider updating state checks to include the ‘inventoried’ state or nfv retrying the host-unlock on the orchestration.

nfv-vim.log encounters exception at the host-unlock step after upgrade-host:

 2021-09-23T15:29:39.166 controller-1 VIM_Thread[194767] DEBUG _host_director.py.605 Canceling previous host operation upgrade-hosts, before continuing with host operation unlock-hosts.2021-09-23T15:29:39.368 controller-1 VIM_Thread[194767] ERROR Caught exception while trying to unlock a host controller-0, error=[OpenStack Rest-API Exception: method=PATCH, url=https://[2607:f160:10:9107:ce:290:0:1000]:6386/v1/ihosts/4920f432-ea20-453e-a043-e638dfb127e9, headers={'Content-Type': 'application/json', 'User-Agent': 'vim/1.0'}, body=[{"path": "/action", "value": "unlock", "op": "replace"}], status_code=400, reason=HTTP Error 400: Bad Request, response_headers=[('Date', 'Thu, 23 Sep 2021 15:29:39 GMT'), ('Content-Length', '227'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains'), ('Content-Type', 'application/json')], response_body={"error_message": "{\"debuginfo\": null, \"faultcode\": \"Client\", \"faultstring\": \"Can not unlock host controller-0 that has not yet been inventoried. Please wait for host to complete initial inventory prior to unlock.\"}"}].Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 2705, in unlock_host

mtcAgent updates state to online after the reinstall has been issued:

2021-09-23T15:29:35.837 fmAPI.cpp(490): Enqueue raise alarm request: UUID (2694e1de-d4b4-4568-9431-24d359959e7d) alarm id (200.022) instant id (host=controller-0.status=online)
2021-09-23T15:29:35.837 [194673.00305] controller-1 mtcAgent inv mtcInvApi.cpp (1119) mtcInvApi_update_state : Info : controller-0 online (seq:20)
2021-09-23T15:29:35.837 [194673.00306] controller-1 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-0 Task: Reinstall Succeeded (seq:21)

However, sysinv just reported the inventoried state change after the host-unlock attempt:

 sysinv 2021-09-23 15:29:39.811 7948 INFO sysinv.agent.manager [-] Initial inventory completed host 4920f432-ea20-453e-a043-e638dfb127e9

Alarms
Just the standard upgrade alarms. Systems were very clean (fresh restoration w/ patches committed).

Workaround
Recreate and rerun the orchestration

John Kung (john-kung)
Changed in starlingx:
assignee: nobody → John Kung (john-kung)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/nfv/+/812740

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/812740
Committed: https://opendev.org/starlingx/nfv/commit/e6c79af445701c75c4d3168f0ae956482a63913a
Submitter: "Zuul (22348)"
Branch: master

commit e6c79af445701c75c4d3168f0ae956482a63913a
Author: John Kung <email address hidden>
Date: Wed Oct 6 14:13:04 2021 -0500

    Upgrade orchestration handle host-unlock after HostUpgrade

    During a platform upgrade, the HostUpgrade step results in a
    host-reinstall which requires the inventory to be repopulated.
    There is a potential sysinv raised exception which may occur
    when attempted too early.

    Similar was done for AIO/workers however, another condition
    requiring retry may also occur on controller/storage due to
    the need to repopulate the inventory prior to the host-unlock.

    Test Plan:
    Perform Orchestrated upgrade of Controller Hosts
    Perform Orchestrated upgrade of Storage Hosts

    Closes-Bug: 1946255
    Signed-off-by: John Kung <email address hidden>
    Change-Id: I59db9d34cae8331bd64135316b931bdf9aec051a

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.6.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.