Stop deploy tasks fails if any node was inaccessible
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Vladimir Sharshov |
Bug Description
1. Start Deployment of any env (I used Neutron VLAN in HA, only 3 controllers, no computes)
2. Turn off one node when provisioning is finished
3. During the deployment process of other nodes, click on Stop Deployment button.
You will observe debug messages in Astute log:
2014-02-23 20:09:11 DEBUG [10800] Retry #1 to run mcollective agent on nodes: '1'
After 5 retries it fails with:
2014-02-23 20:12:42 ERR
[10800] Error running RPC method stop_deploy_task: 3142e9b5-
, trace: ["/opt/
2014-02-23 20:12:42 ERR
[10800] MCollective agents '1' didn't respond within the allotted time.
Expected behavior is that we can stop deployment or provisioning even if there is any disaster with one of a few nodes, this is actually the basic use case of the feature. You stop deployment, then bring up failing node back, and continue deployment.
I was also observing in logs, there are a few pretty strange things, and these needs to be reported as separated bugs:
* some logs lines are duplicated, one is normal Ruby hash, another is JSON format of it. Low priority issue to fix
* Stop deployment task, looks like, didn't actually break initial deployment task, as there was continuing log with an attempt to run puppet on one of the nodes. Puppet was previously killed on that node and deployment failed, as there was a lockfile. Do we really kill that thread (actually, that's fork.... so I don't know how we approach it at the moment).
Env: {"build_id": "2014-02-
Changed in fuel: | |
status: | Confirmed → In Progress |
tags: | added: in progress |
Related fix proposed to branch: master /review. openstack. org/76095
Review: https:/