Patch install failure on controller-1 was not detected by patch orchestration it was showing wrong state

Bug #1842952 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Al Bailey

Bug Description

Brief Description
-----------------

   Patch orchestration doesn’t shows the correct state of patch install on host after 2019-09-04_00-10-00_INSVC_POSTINSTALL_FAILURE patch failure. sw-patch query hosts was showing patch controller-1 install failed but patch strategy shows success . See below output from patch strategy and sw-patch query.

sudo sw-patch query-hosts
Password:
  Hostname IP Address Patch Current Reboot Required Release State
============ =============== ============= =============== ======= ==============
compute-0 192.168.204.166 No No 19.09 idle
compute-1 192.168.204.172 No No 19.09 idle
controller-0 192.168.204.3 No No 19.09 idle
controller-1 192.168.204.4 Failed No 19.09 install-failed

sw-manager patch-strategy show
Strategy Patch Strategy:
  strategy-uuid: 2be9b682-f5ab-4212-a4c6-689385b3519e
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: parallel
  max-parallel-worker-hosts: 2
  default-instance-action: stop-start
  alarm-restrictions: relaxed
  current-phase: abort
  current-phase-completion: 100%
  state: aborted
  apply-result: failed
  apply-reason: alarms from platform are present
  abort-result: success
  abort-reason:

w-manager patch-strategy: error: too few arguments
[sysadmin@controller-0 ~(keystone_admin)]$ sw-manager patch-strategy show --details
Strategy Patch Strategy:
  strategy-uuid: 2be9b682-f5ab-4212-a4c6-689385b3519e
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: parallel
  max-parallel-worker-hosts: 2
  default-instance-action: stop-start
  alarm-restrictions: relaxed
  current-phase: abort
  current-phase-completion: 100%
  state: aborted
  build-phase:
    total-stages: 1
    current-stage: 1
    stop-at-stage: 1
    timeout: 182 seconds
    completion-percentage: 100%
    start-date-time: 2019-09-05 14:46:53
    end-date-time: 2019-09-05 14:46:53
    result: success
    reason:
    stages:
        stage-id: 0
        stage-name: sw-patch-query
        total-steps: 3
        current-step: 3
        timeout: 181 seconds
        start-date-time: 2019-09-05 14:46:53
        end-date-time: 2019-09-05 14:46:53
        result: success
        reason:
        steps:
            step-id: 0
            step-name: query-alarms
            timeout: 60 seconds
            start-date-time: 2019-09-05 14:46:53
            end-date-time: 2019-09-05 14:46:53
            result: success
            reason:

            step-id: 1
            step-name: query-sw-patches
            timeout: 60 seconds
            start-date-time: 2019-09-05 14:46:53
            end-date-time: 2019-09-05 14:46:53
            result: success
            reason:

            step-id: 2
            step-name: query-sw-patch-hosts
            timeout: 60 seconds
            start-date-time: 2019-09-05 14:46:53
            end-date-time: 2019-09-05 14:46:53
            result: success
            reason:

  apply-phase:
    total-stages: 3
    current-stage: 1
    stop-at-stage: 3
    timeout: 5674 seconds
    completion-percentage: 100%
    start-date-time: 2019-09-05 14:47:04
    end-date-time: 2019-09-05 14:47:55
    result: failed
    reason: alarms from platform are present
    stages:
        stage-id: 0
        stage-name: sw-patch-controllers
        total-steps: 3
        current-step: 3
        timeout: 1891 seconds
        start-date-time: 2019-09-05 14:47:04
        end-date-time: 2019-09-05 14:47:55
        result: success
       reason:
        steps:
            step-id: 0
            step-name: query-alarms
            timeout: 60 seconds
            start-date-time: 2019-09-05 14:47:04
            end-date-time: 2019-09-05 14:47:05
            result: success
            reason:

            step-id: 1
            step-name: sw-patch-hosts
            entity-type: hosts
            entity-names: [u'controller-1']
            entity-uuids: [u'46330cc5-9ca3-44b2-a385-38b242b616c0']
            timeout: 1800 seconds
            start-date-time: 2019-09-05 14:47:05
            end-date-time: 2019-09-05 14:47:24
            result: success
            reason:

            step-id: 2
            step-name: system-stabilize
            timeout: 30 seconds
            start-date-time: 2019-09-05 14:47:24
            end-date-time: 2019-09-05 14:47:55
            result: success
            reason:

        stage-id: 1
        stage-name: sw-patch-controllers
        total-steps: 3
        current-step: 0
        timeout: 1891 seconds
        start-date-time: 2019-09-05 14:47:55
        end-date-time: 2019-09-05 14:47:55
        result: failed
        reason: alarms from platform are present
        steps:
            step-id: 0
            step-name: query-alarms
            timeout: 60 seconds
            start-date-time: 2019-09-05 14:47:55
            end-date-time: 2019-09-05 14:47:55
            result: failed
            reason: alarms from platform are present

            step-id: 1
            step-name: sw-patch-hosts
            entity-type: hosts
            entity-names: [u'controller-0']
            entity-uuids: [u'78fc303d-feb7-452a-b7fe-838f4198c6ef']
            timeout: 1800 seconds
            result: initial
            reason:

            step-id: 2
            step-name: system-stabilize
            timeout: 30 seconds
            result: initial
            reason:

        stage-id: 2
        stage-name: sw-patch-worker-hosts
        total-steps: 3
        current-step: 0
        timeout: 1891 seconds
        start-date-time:
        end-date-time:
        result: initial
        reason:
        steps:
            step-id: 0
            step-name: query-alarms
            timeout: 60 seconds
            result: initial
            reason:

            step-id: 1
            step-name: sw-patch-hosts
            entity-type: hosts
            entity-names: [u'compute-1', u'compute-0']
            entity-uuids: [u'f6f60a38-4387-4cd1-9923-f20bb8b1c501', u'cfdd32c8-ce7a-45b3-b04a-e6306d481901']
            timeout: 1800 seconds
            result: initial
            reason:

            step-id: 2
            step-name: system-stabilize
            timeout: 30 seconds
            result: initial

Severity
--------
Major
Steps to Reproduce
------------------
1. Upload in service post failure patch
2. Apply patch
3. Create patch strategy
4. Apply patch strategy and monitor for failure
5. As per description patch failed but patch orchestration didn’t detect

System Configuration
--------------------
Regular system

Expected Behavior
------------------
Patch install failure on hosts should be detected by patch orchestration and shouldn’t go to
Next hosts.
Actual Behavior
----------------
As per description hosts failure was not detected.

Reproducibility
---------------
100% reproducible.
System Configuration
--------------------
Regular system
Load
----
build date :2019-09-04_00-10-00
Last Pass
---------
Not available
Timestamp/Logs
--------------
2019-09-05 14:46:53
Test Activity
-------------
Regression test

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - negative test scenario; not likely for typical use-cases

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Al Bailey (albailey1974)
tags: added: stx.3.0 stx.update
description: updated
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Al Bailey (albailey1974) wrote :

The sw-patch query-hosts shows this for controller-1
  {
        "allow_insvc_patching": true,
        "hostname": "controller-1",
        "installed": {},
        "interim_state": false,
        "ip": "192.168.204.4",
        "missing_pkgs": [],
        "nodetype": "controller",
        "patch_current": true,
        "patch_failed": true,
        "requires_reboot": false,
        "secs_since_ack": 13,
        "stale_details": false,
        "state": "install-failed",
        "subfunctions": [
            "controller"
        ],
        "sw_version": "19.09",
        "to_remove": []
    }

The patch_current and patch_failed states are correctly extracted here:
https://opendev.org/starlingx/nfv/src/branch/master/nfv/nfv-plugins/nfv_plugins/nfvi_plugins/nfvi_sw_mgmt_api.py#L145

and the strategy evaluates it here (first the patch_current, then patch_failed)
https://opendev.org/starlingx/nfv/src/branch/master/nfv/nfv-vim/nfv_vim/strategy/_strategy_steps.py#L515

So presumably, the order of those two checks should be reversed.

Revision history for this message
Frank Miller (sensfan22) wrote :

As this is a minor issue, suggest changing the priority of this to Low.

Changed in starlingx:
importance: Medium → Low
tags: removed: stx.3.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/nfv/+/786952

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/786952
Committed: https://opendev.org/starlingx/nfv/commit/c2f818c959c5709691f0d5c832e8a1777d287684
Submitter: "Zuul (22348)"
Branch: master

commit c2f818c959c5709691f0d5c832e8a1777d287684
Author: albailey <email address hidden>
Date: Mon Apr 19 10:25:03 2021 -0500

    Handle patch install failure scenario in patch orchestration

    When a patch installs an RPM, but there is a post-installation failure
    caused by a restart script error, the host is considered both patch current
    and also patch failed.

    However VIM patch orchestration was considering it as successful.
    The vim logic to check the patch-failed flag is now being done before
    the patch current check, so properly characterize this type
    of patch orchestration problem.

    Closes-Bug: 1842952
    Signed-off-by: albailey <email address hidden>
    Change-Id: I2f8a784be4702537abff7996156344a3b558aefe

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/792239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nfv (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/792239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796327

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nfv (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (f/centos8)
Download full text (14.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/796327
Committed: https://opendev.org/starlingx/nfv/commit/96fa4281d73e701e58388228c8e8e85491785c38
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 73c683d5337beff6062b40f011f3b775f3c70107
Author: Eric MacDonald <email address hidden>
Date: Fri May 21 17:25:38 2021 -0400

    Update fw-update-strategy steps to load wait_time from_dict

    The sw-manager fw-update-strategy feature is seen
    to fail in a traceback.

    The __wait_time member of the FwUpdateHostsStep and
    FwUpdateAbortHostsStep objects are not de-serialized
    from the DB using the ‘from_dict’ methods. This means
    it does not run the ‘init’ method for those classes,
    but instead attempts to re-constitute the object
    directly which can lead to an exception\traceback.

    This update adds the _wait_time member to each of these
    fw-update-strategy class objects' 'from_dict' function.

    This update also removes another object member, this one
    currently unused, that would also not be de-serialized
    if it were to be put to use as is in the future.

    Test Plan:

    PASS: Verify end-to-end orchestrated fw update (x2)

    Closes-Bug: 1929251
    Change-Id: I4540d1712f4dfee74e592c4f3ebce9c7cc913ab2
    Signed-off-by: Eric MacDonald <email address hidden>

commit 5ff24cf13f9d8cacab9ec15ff193fc8c819d31f4
Author: albailey <email address hidden>
Date: Fri May 21 17:51:38 2021 -0500

    Specify the nodeset for zuul jobs

    The py2.7 jobs need to specify xenial
    Changed py37 to py36 and specify bionic.

    The un-specified python3 jobs work fine on either
    focal or bionic.

    zuul is not setup to trigger off code changes in this repo
    so no source code changes are required to trigger the zuul
    jobs

    Partial-Bug: 1928978
    Signed-off-by: albailey <email address hidden>
    Change-Id: Iab9c8727a0f16fa7ff02c20ca3bec5622abe7bd7

commit 98d66c7f3bc46e1a990907db1c8f498f9841c885
Author: albailey <email address hidden>
Date: Thu May 6 12:03:15 2021 -0500

    Fix swact issue when deserializing an old patch strategy

    If a patch strategy in a previous release is de-serialized
    in the vim running a load that contains this commit
    https://review.opendev.org/c/starlingx/nfv/+/780310

    the vim would fail to startup due to key errors as it
    expected fields that did not exist in the previous release.

    Closes-Bug: 1927526
    Signed-off-by: albailey <email address hidden>
    Change-Id: Ia72463feb50f7d6a2491242ec865f7c854c75419

commit e5856549e51f10ae6818ec1d0ec43568225e9bd9
Author: albailey <email address hidden>
Date: Thu May 6 12:46:29 2021 -0500

    Increase the patching apply_patch REST API timeout

    During a kubernetes upgrade orchestration, the kubernetes
    patch needs to be applied. The default timeout was 20 seconds
    but a lab took 24 seconds.

    Thi update increases the timeout for that API call.

    Closes-Bug: 1927532
    Signed-off-by: albailey <email address hidden>
    Change-Id: I63a6c5616f6abf7a5b6879e5ebd458a8ecc52ba7

commit 4ffec1...

tags: added: in-f-centos8
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.