Patch strategy failed to apply due to timeout before nodes is properly restarted

Bug #1907851 reported by Thiago Paiva Brito
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Thiago Paiva Brito

Bug Description

Brief Description
-----------------
RR Patch Orchestration failed due to platform alarm for hypervisor not cleared on time.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
Launch 40 vms
Start RR patching using the following configuration:
- Controller Apply Type: serial
- Storage Apply Type: serial
- Worker Apply Type: parallel
- Maximum Parallel Worker Hosts: 2
- Default Instance Action: migrate
- Alarm Restrictions: relaxed

Expected Behavior
------------------
Expected patch orchestration to be fully applied to all nodes

Actual Behavior
----------------
Patch orchestration failed

Reproducibility
---------------
100% Reproducible (tried 4 times)

System Configuration
--------------------
Dedicated Storage with 8 worker nodes and stx-openstack

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
Did this test scenario pass previously? If so, please indicate the load/pull time info of the last pass.
Use this section to also indicate if this is a new test scenario.

Timestamp/Logs
--------------
2020-11-25 21:02:46 patch orchestration failed

2020-11-25T21:01:58.000 controller-1 fmManager: info

{ "event_log_id" : "270.102", "reason_text" : "Host compute-1 compute services enabled", "entity_instance_id" : "host=compute-1.services=compute", "severity" : "critical", "state" : "msg", "timestamp" : "2020-11-25 21:01:58.733857" }
2020-11-25T21:01:58.000 controller-1 fmManager: info

{ "event_log_id" : "270.001", "reason_text" : "Host compute-1 compute services failure", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-pv-0.host=compute-1.services=compute", "severity" : "critical", "state" : "clear", "timestamp" : "2020-11-25 21:01:58.847231" }
2020-11-25T21:02:02.000 controller-1 fmManager: info

{ "event_log_id" : "275.001", "reason_text" : "Host compute-1 hypervisor is now unlocked-enabled", "entity_instance_id" : "host=compute-1.hypervisor=69c5d65f-9419-43a5-998e-47b10d6b5328", "severity" : "critical", "state" : "msg", "timestamp" : "2020-11-25 21:02:02.669275" }
2020-11-25T21:02:33.000 controller-1 fmManager: info

{ "event_log_id" : "275.001", "reason_text" : "Host compute-0 hypervisor is now unlocked-enabled", "entity_instance_id" : "host=compute-0.hypervisor=3528172e-eff8-49f6-9f8d-5cc1d43ee18b", "severity" : "critical", "state" : "msg", "timestamp" : "2020-11-25 21:02:33.485982" }
2020-11-25T21:02:46.000 controller-1 fmManager: info

{ "event_log_id" : "900.101", "reason_text" : "Software patch auto-apply inprogress", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-pv-0.orchestration=sw-patch", "severity" : "major", "state" : "clear", "timestamp" : "2020-11-25 21:02:46.669574" }
2020-11-25T21:02:46.000 controller-1 fmManager: info

{ "event_log_id" : "900.115", "reason_text" : "Software patch auto-apply failed, reason = alarms from platform are present", "entity_instance_id" : "orchestration=sw-patch", "severity" : "critical", "state" : "msg", "timestamp" : "2020-11-25 21:02:46.505503" }
2020-11-25T21:02:46.000 controller-1 fmManager: info

{ "event_log_id" : "900.103", "reason_text" : "Software patch auto-apply failed", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-pv-0.orchestration=sw-patch", "severity" : "critical", "state" : "set", "timestamp" : "2020-11-25 21:02:46.506976" }
2020-11-25T21:02:46.000 controller-1 fmManager: info

{ "event_log_id" : "900.121", "reason_text" : "Software patch auto-apply aborted", "entity_instance_id" : "orchestration=sw-patch", "severity" : "critical", "state" : "msg", "timestamp" : "2020-11-25 21:02:46.549657" }
2020-11-25T21:03:06.000 controller-1 fmManager: info

{ "event_log_id" : "270.001", "reason_text" : "Host compute-0 compute services failure", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-pv-0.host=compute-0.services=compute", "severity" : "critical", "state" : "clear", "timestamp" : "2020-11-25 21:03:06.684012" }
2020-11-25T21:03:06.000 controller-1 fmManager: info

{ "event_log_id" : "270.102", "reason_text" : "Host compute-0 compute services enabled", "entity_instance_id" : "host=compute-0.services=compute", "severity" : "critical", "state" : "msg", "timestamp" : "2020-11-25 21:03:06.642849" }

Test Activity
-------------
System Test

Workaround
----------
Recreate and reapply the strategy several times until all servers are in Applied state

Changed in starlingx:
assignee: nobody → Thiago Paiva Brito (outbrito)
Revision history for this message
Thiago Paiva Brito (outbrito) wrote :

Investigating this problem, I figured that the problem happens because, upgrading the computes in lots of 2, the process moves to the next phase of the strategy while the VIM alarm about the "hypervisor disabled" is still active due to a compute that is still restarting. For the next phase, the first step which is QueryAlarms times out after just one minute, not leaving enough time for the compute to finish getting up with the nova-compute pod. 15 to 20 seconds after the strategy fails, the compute goes to "hypervisor enabled"state. This behavior was verified in the logs and reproduced at least 4 times on the described setup.

Increasing the timeout of QueryAlarms is a quick solution to that, but I think we should change the worker apply stages to add a WaitAlarmsClearStep also for computes, as it now waits only when applying to workers that are also on the Openstack control plane (Simplex and Duplex).

description: updated
Revision history for this message
Thiago Paiva Brito (outbrito) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.nfv
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/792239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nfv (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/792239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796327

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nfv (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (f/centos8)
Download full text (14.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/796327
Committed: https://opendev.org/starlingx/nfv/commit/96fa4281d73e701e58388228c8e8e85491785c38
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 73c683d5337beff6062b40f011f3b775f3c70107
Author: Eric MacDonald <email address hidden>
Date: Fri May 21 17:25:38 2021 -0400

    Update fw-update-strategy steps to load wait_time from_dict

    The sw-manager fw-update-strategy feature is seen
    to fail in a traceback.

    The __wait_time member of the FwUpdateHostsStep and
    FwUpdateAbortHostsStep objects are not de-serialized
    from the DB using the ‘from_dict’ methods. This means
    it does not run the ‘init’ method for those classes,
    but instead attempts to re-constitute the object
    directly which can lead to an exception\traceback.

    This update adds the _wait_time member to each of these
    fw-update-strategy class objects' 'from_dict' function.

    This update also removes another object member, this one
    currently unused, that would also not be de-serialized
    if it were to be put to use as is in the future.

    Test Plan:

    PASS: Verify end-to-end orchestrated fw update (x2)

    Closes-Bug: 1929251
    Change-Id: I4540d1712f4dfee74e592c4f3ebce9c7cc913ab2
    Signed-off-by: Eric MacDonald <email address hidden>

commit 5ff24cf13f9d8cacab9ec15ff193fc8c819d31f4
Author: albailey <email address hidden>
Date: Fri May 21 17:51:38 2021 -0500

    Specify the nodeset for zuul jobs

    The py2.7 jobs need to specify xenial
    Changed py37 to py36 and specify bionic.

    The un-specified python3 jobs work fine on either
    focal or bionic.

    zuul is not setup to trigger off code changes in this repo
    so no source code changes are required to trigger the zuul
    jobs

    Partial-Bug: 1928978
    Signed-off-by: albailey <email address hidden>
    Change-Id: Iab9c8727a0f16fa7ff02c20ca3bec5622abe7bd7

commit 98d66c7f3bc46e1a990907db1c8f498f9841c885
Author: albailey <email address hidden>
Date: Thu May 6 12:03:15 2021 -0500

    Fix swact issue when deserializing an old patch strategy

    If a patch strategy in a previous release is de-serialized
    in the vim running a load that contains this commit
    https://review.opendev.org/c/starlingx/nfv/+/780310

    the vim would fail to startup due to key errors as it
    expected fields that did not exist in the previous release.

    Closes-Bug: 1927526
    Signed-off-by: albailey <email address hidden>
    Change-Id: Ia72463feb50f7d6a2491242ec865f7c854c75419

commit e5856549e51f10ae6818ec1d0ec43568225e9bd9
Author: albailey <email address hidden>
Date: Thu May 6 12:46:29 2021 -0500

    Increase the patching apply_patch REST API timeout

    During a kubernetes upgrade orchestration, the kubernetes
    patch needs to be applied. The default timeout was 20 seconds
    but a lab took 24 seconds.

    Thi update increases the timeout for that API call.

    Closes-Bug: 1927532
    Signed-off-by: albailey <email address hidden>
    Change-Id: I63a6c5616f6abf7a5b6879e5ebd458a8ecc52ba7

commit 4ffec1...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.