StarlingX

Distributed Cloud: Patch Orchestration Loop

Bug #1788882 reported by Kristine Bujold on 2018-08-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Kristine Bujold

Bug Description

Brief Description
-----------------
2 System Controllers
2 subclouds, “One node system” and “Two node system”

While preparing for a Distributed Cloud Patch Orchestration I removed a patch and started patch orchestration again, but this time I didn't notice that I had selected 'migration' as the default instance action.

This caused the “One node system” subcloud to fail to apply/remove the patch (even though I didn't have any instances running) and for some reason caused my “Two node system” subcloud to loop indefinitely on the patch orchestration action.....about 10hours and counting.

I see this message looping in the dcmanger.log on the active system controller:

2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager self.apply(sw_update_strategy)
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager File "/usr/lib/python2.7/site-packages/dcmanager/manager/sw_update_manager.py", line 489, in apply
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager LOG.debug("Working on stage %d" % current_stage)
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager TypeError: %d format: a number is required, not NoneType
2018-07-17 08:08:25.027 176949 ERROR dcmanager.manager.sw_update_manager

Steps to Reproduce
------------------
It may be difficult to reproduce the exact scenario, but is probably possible by trying to get a patch remove/add on one subcloud to fail in a multiple subcloud scenario with the stop-on-failure flag set in the patch strategy.

Or one could manually creating an error scenario by modifying the dcmanager database manually.

Expected Behavior
------------------
The overall strategy should fail and not a specific step stuck in an exception loop.

Actual Behavior
----------------
The strategy step is failed and stuck in an exception loop. The system is not recoverable w/o manual intervention.

Reproducibility
---------------
Need to hit a specific failure scenario. Probably hard to reproduce.

System Configuration
--------------------
2 System Controllers
2 subclouds, “One node system” and “Two node system”

Branch/Pull Time/Commit
-----------------------
master

Timestamp/Logs
--------------
2018-07-17 08:08:25.027

See original description

Tags:

Ghada Khalil (gkhalil) on 2018-08-24

Changed in starlingx:
assignee:	nobody → Kristine Bujold (kbujold)

Kristine Bujold (kbujold) on 2018-08-24

description:

updated

Ghada Khalil (gkhalil) on 2018-08-24

Changed in starlingx:
status:	New → Triaged

Kristine Bujold (kbujold) on 2018-08-24

description:

updated

Kristine Bujold (kbujold) on 2018-08-24

description:

updated

Ghada Khalil (gkhalil) on 2018-08-24

tags:

added: stx.2018.10 stx.distcloud

Ghada Khalil (gkhalil) on 2018-08-24

Changed in starlingx:
importance:	Undecided → High
importance:	High → Medium

Revision history for this message

Kristine Bujold (kbujold) wrote on 2018-08-27:

Merged into starlingx-staging
https://github.com/starlingx-staging/stx-distcloud/pull/2

Changed in starlingx:
status:	Triaged → In Progress
status:	In Progress → Fix Released

Ken Young (kenyis) on 2019-04-06

tags:

added: stx.1.0
removed: stx.2018.10

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.