Patch Orchestration fails with in-service patches

Bug #1848541 reported by David Sullivan
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
During an in-service patch some worker nodes experienced CPU usage alarms for more than 30s. This caused patch orchestration to fail.

Severity
--------
Major

Steps to Reproduce
------------------
On 2 + 20 system launch 600 pods
Perform patch orchestration with an in-service patch
Use 10 parallel worker nodes

Expected Behavior
------------------
Patch orchestration completes

Actual Behavior
----------------
Patch orchestration fails due to platform CPU alarms

Reproducibility
---------------
Reproducible

System Configuration
--------------------
2 + 20

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
Passed April 29th.

Timestamp/Logs
--------------
2019-10-09T18:42:30.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.07%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-35-60.host=compute-4", "severity" : "critical", "state" : "clear", "timestamp" : "2019-10-09 18:42:30.234560" }
2019-10-09T18:42:29.151 controller-0 VIM_Thread[1100113] INFO _strategy.py.382 Apply Complete Callback, result=failed, reason=alarms from platform are present.

Test Activity
-------------
System Test

Revision history for this message
Frank Miller (sensfan22) wrote :

This issue will be addressed by a fix for this LP: https://bugs.launchpad.net/starlingx/+bug/1848580

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The proposal is for the CPU alarm to be marked as non-management affecting. This will result in patch orchestration ignoring it.

tags: added: stx.3.0 stx.metal
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - affects patch orchestration

Changed in starlingx:
assignee: nobody → Kevin Smith (kevin.smith.wrs)
Revision history for this message
Bart Wensley (bartwensley) wrote :

I disagree with having patch orchestration ignore the CPU alarms. If there happened to be a (bad) patch that caused run-away CPU usage in some configurations, we would want patch orchestration to stop after this condition was detected on the first set of hosts. Otherwise, patch orchestration would continue and hose the entire system.

The right thing to do here is to update the CPU alarm detection logic to be more tolerant.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on Bart's input, going back to the option where we keep the patch orchestration logic as is. Making the CPU alarm more tolerant is handled by: https://bugs.launchpad.net/starlingx/+bug/1848580

Marking as duplicate

Changed in starlingx:
assignee: Kevin Smith (kevin.smith.wrs) → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug is fixed by:
https://review.opendev.org/#/c/690791/
https://review.opendev.org/#/c/690794/

Merged on 2019-10-25
Marking as Fix Released

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.