Stop deploy tasks fails if any node was inaccessible

Bug #1283812 reported by Mike Scherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Sharshov

Bug Description

1. Start Deployment of any env (I used Neutron VLAN in HA, only 3 controllers, no computes)
2. Turn off one node when provisioning is finished
3. During the deployment process of other nodes, click on Stop Deployment button.

You will observe debug messages in Astute log:
2014-02-23 20:09:11 DEBUG [10800] Retry #1 to run mcollective agent on nodes: '1'
After 5 retries it fails with:
2014-02-23 20:12:42 ERR
[10800] Error running RPC method stop_deploy_task: 3142e9b5-c981-48c7-856b-db3752e911c9: MCollective agents '1' didn't respond within the allotted time.
, trace: ["/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/astute-0.0.2/lib/astute/mclient.rb:113:in `check_results_with_retries'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/astute-0.0.2/lib/astute/mclient.rb:61:in `method_missing'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/astute-0.0.2/lib/astute/orchestrator.rb:162:in `stop_puppet_deploy'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/dispatcher.rb:187:in `stop_current_task'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/dispatcher.rb:156:in `stop_deploy_task'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:132:in `dispatch_message'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:85:in `block in dispatch'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:83:in `each'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:83:in `each_with_index'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:83:in `dispatch'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:78:in `block in perform_service_job'"]
2014-02-23 20:12:42 ERR
[10800] MCollective agents '1' didn't respond within the allotted time.

Expected behavior is that we can stop deployment or provisioning even if there is any disaster with one of a few nodes, this is actually the basic use case of the feature. You stop deployment, then bring up failing node back, and continue deployment.

I was also observing in logs, there are a few pretty strange things, and these needs to be reported as separated bugs:
* some logs lines are duplicated, one is normal Ruby hash, another is JSON format of it. Low priority issue to fix
* Stop deployment task, looks like, didn't actually break initial deployment task, as there was continuing log with an attempt to run puppet on one of the nodes. Puppet was previously killed on that node and deployment failed, as there was a lockfile. Do we really kill that thread (actually, that's fork.... so I don't know how we approach it at the moment).

Env: {"build_id": "2014-02-23_01-17-30", "mirantis": "yes", "build_number": "180", "nailgun_sha": "f786786894acc331a4b53b31f33e373ef95ccdfc", "ostf_sha": "b8f16a0288cbf39e11e0b4a41a3f63e6b87dcc4b", "fuelmain_sha": "421f2aaa7b1494e899368908b295acc5fe7e012f", "astute_sha": "3d43abeefb60677ce6cae83d31ebbba1ff3cdbe2", "release": "4.1", "fuellib_sha": "e3ea44c3b607f37401a268a91956c9d222a81bab"}

Tags: in progress
Revision history for this message
Mike Scherbakov (mihgen) wrote :
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/76095

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/76098

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/76095
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=ca8e2b714b60125bf9d1d7c671f8761ea8d4be39
Submitter: Jenkins
Branch: master

commit ca8e2b714b60125bf9d1d7c671f8761ea8d4be39
Author: Vladimir Sharshov <email address hidden>
Date: Tue Feb 25 09:53:56 2014 +0400

    Ignore inaccessible nodes when try to stop a deploy

    Change-Id: I26854c24acffa1f56d6f5fb9c361ff77d000617e
    Related-Bug: #1283812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/76098
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=511153a10a8e1d5bbc0bbfd9078eebed04bb22a1
Submitter: Jenkins
Branch: master

commit 511153a10a8e1d5bbc0bbfd9078eebed04bb22a1
Author: Vladimir Sharshov <email address hidden>
Date: Mon Feb 24 13:57:23 2014 +0400

    New way to stop a main thread

    Use the kill instead of raise a custom exception.
    For some reason mcollective capture all exceptions
    if one of node becames inaccessible.

    Bug 1282065 closes because the problem condition was deleted.

    Change-Id: Ia7b9ef9734883a470bea592c398359f75b807d45
    Closes-Bug: #1283812
    Closes-Bug: #1282065

Changed in fuel:
status: In Progress → Fix Committed
tags: added: in progress
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Verified on ISO #211

Revision:baa8bb07393698f1186cb67bb65f1b93907c59bd
origin/master

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.