StarlingX

Patch Orchestration fails with in-service patches

Bug #1848541 reported by David Sullivan on 2019-10-17

This bug report is a duplicate of: Bug #1848580: Platform cpu alarm triggered for too short a duration. Edit Remove

10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
During an in-service patch some worker nodes experienced CPU usage alarms for more than 30s. This caused patch orchestration to fail.

Severity
--------
Major

Steps to Reproduce
------------------
On 2 + 20 system launch 600 pods
Perform patch orchestration with an in-service patch
Use 10 parallel worker nodes

Expected Behavior
------------------
Patch orchestration completes

Actual Behavior
----------------
Patch orchestration fails due to platform CPU alarms

Reproducibility
---------------
Reproducible

System Configuration
--------------------
2 + 20

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
Passed April 29th.

Timestamp/Logs
--------------
2019-10-09T18:42:30.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.07%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-35-60.host=compute-4", "severity" : "critical", "state" : "clear", "timestamp" : "2019-10-09 18:42:30.234560" }
2019-10-09T18:42:29.151 controller-0 VIM_Thread[1100113] INFO _strategy.py.382 Apply Complete Callback, result=failed, reason=alarms from platform are present.

Test Activity
-------------
System Test

Tags:

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-10-18:

#1

This issue will be addressed by a fix for this LP: https://bugs.launchpad.net/starlingx/+bug/1848580

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-18:

#2

The proposal is for the CPU alarm to be marked as non-management affecting. This will result in patch orchestration ignoring it.

tags:	added: stx.3.0 stx.metal
Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-18:

#3

stx.3.0 / medium priority - affects patch orchestration

Changed in starlingx:
assignee:	nobody → Kevin Smith (kevin.smith.wrs)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-10-18:

#4

I disagree with having patch orchestration ignore the CPU alarms. If there happened to be a (bad) patch that caused run-away CPU usage in some configurations, we would want patch orchestration to stop after this condition was detected on the first set of hosts. Otherwise, patch orchestration would continue and hose the entire system.

The right thing to do here is to update the CPU alarm detection logic to be more tolerant.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-18:

#5

Based on Bart's input, going back to the option where we keep the patch orchestration logic as is. Making the CPU alarm more tolerant is handled by: https://bugs.launchpad.net/starlingx/+bug/1848580

Marking as duplicate

Changed in starlingx:
assignee:	Kevin Smith (kevin.smith.wrs) → Eric MacDonald (rocksolidmtce)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-30:

#6

Duplicate bug is fixed by:
https://review.opendev.org/#/c/690791/
https://review.opendev.org/#/c/690794/

Merged on 2019-10-25
Marking as Fix Released

Changed in starlingx:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1848580 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.