Distributed Cloud: Patch Orchestration Loop
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Kristine Bujold |
Bug Description
Brief Description
-----------------
2 System Controllers
2 subclouds, “One node system” and “Two node system”
While preparing for a Distributed Cloud Patch Orchestration I removed a patch and started patch orchestration again, but this time I didn't notice that I had selected 'migration' as the default instance action.
This caused the “One node system” subcloud to fail to apply/remove the patch (even though I didn't have any instances running) and for some reason caused my “Two node system” subcloud to loop indefinitely on the patch orchestration action.....about 10hours and counting.
I see this message looping in the dcmanger.log on the active system controller:
2018-07-17 08:08:25.027 176949 ERROR dcmanager.
2018-07-17 08:08:25.027 176949 ERROR dcmanager.
2018-07-17 08:08:25.027 176949 ERROR dcmanager.
2018-07-17 08:08:25.027 176949 ERROR dcmanager.
2018-07-17 08:08:25.027 176949 ERROR dcmanager.
Steps to Reproduce
------------------
It may be difficult to reproduce the exact scenario, but is probably possible by trying to get a patch remove/add on one subcloud to fail in a multiple subcloud scenario with the stop-on-failure flag set in the patch strategy.
Or one could manually creating an error scenario by modifying the dcmanager database manually.
Expected Behavior
------------------
The overall strategy should fail and not a specific step stuck in an exception loop.
Actual Behavior
----------------
The strategy step is failed and stuck in an exception loop. The system is not recoverable w/o manual intervention.
Reproducibility
---------------
Need to hit a specific failure scenario. Probably hard to reproduce.
System Configuration
-------
2 System Controllers
2 subclouds, “One node system” and “Two node system”
Branch/Pull Time/Commit
-------
master
Timestamp/Logs
--------------
2018-07-17 08:08:25.027
Changed in starlingx: | |
assignee: | nobody → Kristine Bujold (kbujold) |
description: | updated |
Changed in starlingx: | |
status: | New → Triaged |
description: | updated |
description: | updated |
tags: | added: stx.2018.10 stx.distcloud |
Changed in starlingx: | |
importance: | Undecided → High |
importance: | High → Medium |
tags: |
added: stx.1.0 removed: stx.2018.10 |
Merged into starlingx-staging /github. com/starlingx- staging/ stx-distcloud/ pull/2
https:/