Distributed Cloud: Patch Orchestration Loop

Bug #1788882 reported by Kristine Bujold
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kristine Bujold

Bug Description

Brief Description
-----------------
2 System Controllers
2 subclouds, “One node system” and “Two node system”

While preparing for a Distributed Cloud Patch Orchestration I removed a patch and started patch orchestration again, but this time I didn't notice that I had selected 'migration' as the default instance action.

This caused the “One node system” subcloud to fail to apply/remove the patch (even though I didn't have any instances running) and for some reason caused my “Two node system” subcloud to loop indefinitely on the patch orchestration action.....about 10hours and counting.

I see this message looping in the dcmanger.log on the active system controller:

2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager self.apply(sw_update_strategy)
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager File "/usr/lib/python2.7/site-packages/dcmanager/manager/sw_update_manager.py", line 489, in apply
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager LOG.debug("Working on stage %d" % current_stage)
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager TypeError: %d format: a number is required, not NoneType
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager

Steps to Reproduce
------------------
It may be difficult to reproduce the exact scenario, but is probably possible by trying to get a patch remove/add on one subcloud to fail in a multiple subcloud scenario with the stop-on-failure flag set in the patch strategy.

Or one could manually creating an error scenario by modifying the dcmanager database manually.

Expected Behavior
------------------
The overall strategy should fail and not a specific step stuck in an exception loop.

Actual Behavior
----------------
The strategy step is failed and stuck in an exception loop. The system is not recoverable w/o manual intervention.

Reproducibility
---------------
Need to hit a specific failure scenario. Probably hard to reproduce.

System Configuration
--------------------
2 System Controllers
2 subclouds, “One node system” and “Two node system”

Branch/Pull Time/Commit
-----------------------
master

Timestamp/Logs
--------------
2018-07-17 08:08:25.027

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Kristine Bujold (kbujold)
description: updated
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
description: updated
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.2018.10 stx.distcloud
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
importance: High → Medium
Revision history for this message
Kristine Bujold (kbujold) wrote :
Changed in starlingx:
status: Triaged → In Progress
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.