Patch strategy failed to apply due to timeout before nodes is properly restarted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Thiago Paiva Brito |
Bug Description
Brief Description
-----------------
RR Patch Orchestration failed due to platform alarm for hypervisor not cleared on time.
Severity
--------
Major: System/Feature is usable but degraded
Steps to Reproduce
------------------
Launch 40 vms
Start RR patching using the following configuration:
- Controller Apply Type: serial
- Storage Apply Type: serial
- Worker Apply Type: parallel
- Maximum Parallel Worker Hosts: 2
- Default Instance Action: migrate
- Alarm Restrictions: relaxed
Expected Behavior
------------------
Expected patch orchestration to be fully applied to all nodes
Actual Behavior
----------------
Patch orchestration failed
Reproducibility
---------------
100% Reproducible (tried 4 times)
System Configuration
-------
Dedicated Storage with 8 worker nodes and stx-openstack
Branch/Pull Time/Commit
-------
Branch and the time when code was pulled or git commit or cengn load info
Last Pass
---------
Did this test scenario pass previously? If so, please indicate the load/pull time info of the last pass.
Use this section to also indicate if this is a new test scenario.
Timestamp/Logs
--------------
2020-11-25 21:02:46 patch orchestration failed
2020-11-
{ "event_log_id" : "270.102", "reason_text" : "Host compute-1 compute services enabled", "entity_
2020-11-
{ "event_log_id" : "270.001", "reason_text" : "Host compute-1 compute services failure", "entity_
2020-11-
{ "event_log_id" : "275.001", "reason_text" : "Host compute-1 hypervisor is now unlocked-enabled", "entity_
2020-11-
{ "event_log_id" : "275.001", "reason_text" : "Host compute-0 hypervisor is now unlocked-enabled", "entity_
2020-11-
{ "event_log_id" : "900.101", "reason_text" : "Software patch auto-apply inprogress", "entity_
2020-11-
{ "event_log_id" : "900.115", "reason_text" : "Software patch auto-apply failed, reason = alarms from platform are present", "entity_
2020-11-
{ "event_log_id" : "900.103", "reason_text" : "Software patch auto-apply failed", "entity_
2020-11-
{ "event_log_id" : "900.121", "reason_text" : "Software patch auto-apply aborted", "entity_
2020-11-
{ "event_log_id" : "270.001", "reason_text" : "Host compute-0 compute services failure", "entity_
2020-11-
{ "event_log_id" : "270.102", "reason_text" : "Host compute-0 compute services enabled", "entity_
Test Activity
-------------
System Test
Workaround
----------
Recreate and reapply the strategy several times until all servers are in Applied state
Changed in starlingx: | |
assignee: | nobody → Thiago Paiva Brito (outbrito) |
tags: | added: stx.nfv |
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
tags: | added: stx.5.0 |
Changed in starlingx: | |
status: | Triaged → Fix Released |
Investigating this problem, I figured that the problem happens because, upgrading the computes in lots of 2, the process moves to the next phase of the strategy while the VIM alarm about the "hypervisor disabled" is still active due to a compute that is still restarting. For the next phase, the first step which is QueryAlarms times out after just one minute, not leaving enough time for the compute to finish getting up with the nova-compute pod. 15 to 20 seconds after the strategy fails, the compute goes to "hypervisor enabled"state. This behavior was verified in the logs and reproduced at least 4 times on the described setup.
Increasing the timeout of QueryAlarms is a quick solution to that, but I think we should change the worker apply stages to add a WaitAlarmsClearStep also for computes, as it now waits only when applying to workers that are also on the Openstack control plane (Simplex and Duplex).