Patch orchestration failed on unlocking controllers

Bug #1883176 reported by Anujeyan Manokeran
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Eric MacDonald

Bug Description

Brief Description
-----------------

          During test patch apply using patch orchestration controller unlock was failed . When controller was unlocked manually it was successful . Further investigation by Bart following findings.

2020-06-11 17:06:43 – The VIM issues a host-install request to patching for controller-0. This seems to complete immediately.
The VIM then waits 15 seconds (intentionally).
2020-06-11 17:06:59 – The VIM issues a host-lock request for controller-0 to sysinv:
Sysinv queries patching to see if the host is patch current.
The patcher seems to say the host is not patch current:
sysinv 2020-06-11 17:07:00.289 254793 WARNING wsme.api [-] Client-side error: host-unlock rejected: Not patch current. 'sw-patch host-install controller-0' is required.: ClientSideError: host-unlock rejected: Not patch current. 'sw-patch host-install controller-0' is required.

Patch orchestration details

apply-phase:
    total-stages: 2
    current-stage: 0
    stop-at-stage: 2
    timeout: 10173 seconds
    completion-percentage: 100%
    start-date-time: 2020-06-11 17:00:16
    end-date-time: 2020-06-11 17:07:00
    result: failed
    reason: host unlock failed
    stages:
        stage-id: 0
        stage-name: sw-patch-controllers
        total-steps: 7
        current-step: 5
        timeout: 5536 seconds
        start-date-time: 2020-06-11 17:00:16
        end-date-time: 2020-06-11 17:07:00
        result: failed
        reason: host unlock failed
        steps:
            step-id: 0
            step-name: query-alarms
            timeout: 60 seconds
            start-date-time: 2020-06-11 17:00:16
            end-date-time: 2020-06-11 17:00:16
            result: success
            reason:
            step-id: 1
            step-name: swact-hosts
            entity-type: hosts
            entity-names: [u'controller-0']
            timeout: 900 seconds
            start-date-time: 2020-06-11 17:00:16
            end-date-time: 2020-06-11 17:03:56
            result: success
            reason:
            step-id: 2
            step-name: lock-hosts
            entity-type: hosts
            entity-names: [u'controller-0']
            timeout: 900 seconds
            start-date-time: 2020-06-11 17:03:56
            end-date-time: 2020-06-11 17:06:43
            result: success
            reason:
             step-id: 3
            step-name: sw-patch-hosts
            entity-type: hosts
            entity-names: [u'controller-0']
            timeout: 1800 seconds
            start-date-time: 2020-06-11 17:06:43
          step-name: sw-patch-hosts
            entity-type: hosts
            entity-names: [u'controller-0']
            timeout: 1800 seconds
            start-date-time: 2020-06-11 17:06:43
            end-date-time: 2020-06-11 17:06:43
            result: success
            reason:
            step-id: 4
            step-name: system-stabilize
            timeout: 15 seconds
            start-date-time: 2020-06-11 17:06:43
            end-date-time: 2020-06-11 17:06:59
            result: success
            reason:
            step-id: 5
            step-name: unlock-hosts
            entity-type: hosts
            entity-names: [u'controller-0']
            timeout: 1800 seconds
            start-date-time: 2020-06-11 17:06:59
            end-date-time: 2020-06-11 17:07:00
            result: failed
            reason: host unlock failed
            step-id: 6
            step-name: system-stabilize
            timeout: 60 seconds
            result: initial
            reason:
        stage-id: 1
        stage-name: sw-patch-worker-hosts
        total-steps: 6
        current-step: 0
        timeout: 4636 seconds
        start-date-time:
        end-date-time:
        result: initial
        reason:
        steps:
            step-id: 0
            step-name: query-alarms
            timeout: 60 seconds
            result: initial
            reason:
            step-id: 1
            step-name: lock-hosts
            entity-type: hosts

Severity
--------
Major

System Configuration
--------------------
wcp-71-75

Expected Behavior
------------------

No failure on unlock

Actual Behavior
----------------

As description says unlock by patch orchestration fails

Reproducibility
---------------

Tried only once with this load.

Load

-------
2020-06-10_20-00-00

Last Pass
---------

It was passed on 2020-06-10_20-00-00 with different
Reboot able patch . Above failure was large test patch.

Timestamp/Logs
--------------
2020-06-11 17:06:43

Test Activity
-------------
Regression test

summary: - patch orchestration failed on unlocking controllers
+ Patch orchestration failed on unlocking controllers
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
tags: added: stx.retestneeded
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was reproduced in wcp-78-79(duplex lab) in load 2020-06-10 22:43:29.
 apply-phase:
    total-stages: 2
    current-stage: 0
    stop-at-stage: 2
    timeout: 11073 seconds
    completion-percentage: 100%
    start-date-time: 2020-06-12 13:44:15
    end-date-time: 2020-06-12 13:50:34
    result: failed
    reason: host unlock failed
    stages:
        stage-id: 0
        stage-name: sw-patch-worker-hosts
        total-steps: 7
        current-step: 5
        timeout: 5536 seconds
        start-date-time: 2020-06-12 13:44:15
        end-date-time: 2020-06-12 13:50:34
        result: failed
        reason: host unlock failed
        steps:

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Low priority - this is only seen by a large patch that attempts to update 1000 rpms. It's not a realistic user scenario

tags: added: stx.update
description: updated
Changed in starlingx:
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Don Penney (dpenney)
Revision history for this message
Bart Wensley (bartwensley) wrote :

It turns out this could happen with any reboot required patch. There are some other conditions that must be true as well (e.g. the first controller to be patched must be the active controller). This should be fixed for stx.4.0.

The bug was introduced with the FPGA orchestration changes. The bug is in nfv_vim/database/_database_infrastructure_module.py. The database_host_get_list function is using the wrong data model (Host_v6 instead of Host_v7). This results in the hosts table being empty when the VIM starts up over a swact. In turn, the SwPatch object is recreated incorrectly - any strategy steps that require the host table to recreate themselves (e.g. SwPatchHostsStep) will be recreated with missing data and then the strategy will not apply correctly.

Changed in starlingx:
assignee: Don Penney (dpenney) → Eric MacDonald (rocksolidmtce)
tags: added: stx.nfv
removed: stx.update
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Raising the priority based on above comment from Bart

Changed in starlingx:
importance: Low → High
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.4.0
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Bart Wensley (bartwensley)
assignee: Bart Wensley (bartwensley) → Eric MacDonald (rocksolidmtce)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/735938
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=f19dc403602ed072f641c8d1cefb6a848389862a
Submitter: Zuul
Branch: master

commit f19dc403602ed072f641c8d1cefb6a848389862a
Author: Eric MacDonald <email address hidden>
Date: Tue Jun 16 10:32:52 2020 -0400

    Fix patch orchestration controller host unlock failure

    Recent fw update orchestration introduced a new version of
    host data model (Host_v7) without updating the host model
    used by the host get list version which broke patch
    orchestration for reboot required patches.

    This update increments the required host data model for get
    host list from Host_v6 to Host_v7

    Change-Id: I0b25d45e1c61dbbc91ca9fcec77472971ac836e0
    Closes-Bug: 1883176
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

Verified in 2 + 3 System (wcp-71-75) installed with build: 2020-06-27_00-41-42.
An RR patch was applied via patch-orchestration successfully to a all nodes.

tags: removed: stx.4.0 stx.nfv stx.retestneeded
Ghada Khalil (gkhalil)
tags: added: stx.4.0 stx.nfv
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.