Fuel for OpenStack

Stop deploy tasks fails if any node was inaccessible

Bug #1283812 reported by Mike Scherbakov on 2014-02-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Vladimir Sharshov	Fuel for OpenStack 4.1

Bug Description

1. Start Deployment of any env (I used Neutron VLAN in HA, only 3 controllers, no computes)
2. Turn off one node when provisioning is finished
3. During the deployment process of other nodes, click on Stop Deployment button.

You will observe debug messages in Astute log:
2014-02-23 20:09:11 DEBUG [10800] Retry #1 to run mcollective agent on nodes: '1'
After 5 retries it fails with:
2014-02-23 20:12:42 ERR
[10800] Error running RPC method stop_deploy_task: 3142e9b5-c981-48c7-856b-db3752e911c9: MCollective agents '1' didn't respond within the allotted time.
, trace: ["/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/astute-0.0.2/lib/astute/mclient.rb:113:in `check_results_with_retries'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/astute-0.0.2/lib/astute/mclient.rb:61:in `method_missing'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/astute-0.0.2/lib/astute/orchestrator.rb:162:in `stop_puppet_deploy'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/dispatcher.rb:187:in `stop_current_task'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/dispatcher.rb:156:in `stop_deploy_task'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:132:in `dispatch_message'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:85:in `block in dispatch'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:83:in `each'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:83:in `each_with_index'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:83:in `dispatch'", "/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/naily-0.1.0/lib/naily/server.rb:78:in `block in perform_service_job'"]
2014-02-23 20:12:42 ERR
[10800] MCollective agents '1' didn't respond within the allotted time.

Expected behavior is that we can stop deployment or provisioning even if there is any disaster with one of a few nodes, this is actually the basic use case of the feature. You stop deployment, then bring up failing node back, and continue deployment.

I was also observing in logs, there are a few pretty strange things, and these needs to be reported as separated bugs:
* some logs lines are duplicated, one is normal Ruby hash, another is JSON format of it. Low priority issue to fix
* Stop deployment task, looks like, didn't actually break initial deployment task, as there was continuing log with an attempt to run puppet on one of the nodes. Puppet was previously killed on that node and deployment failed, as there was a lockfile. Do we really kill that thread (actually, that's fork.... so I don't know how we approach it at the moment).

Env: {"build_id": "2014-02-23_01-17-30", "mirantis": "yes", "build_number": "180", "nailgun_sha": "f786786894acc331a4b53b31f33e373ef95ccdfc", "ostf_sha": "b8f16a0288cbf39e11e0b4a41a3f63e6b87dcc4b", "fuelmain_sha": "421f2aaa7b1494e899368908b295acc5fe7e012f", "astute_sha": "3d43abeefb60677ce6cae83d31ebbba1ff3cdbe2", "release": "4.1", "fuellib_sha": "e3ea44c3b607f37401a268a91956c9d222a81bab"}

Tags:

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-02-23:

fuel-snapshot-2014-02-23_20-15-25.tgz Edit (2.4 MiB, application/x-tar)

Vladimir Sharshov (vsharshov) on 2014-02-24

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-25: Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/76095

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-25: Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/76098

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-25: Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/76095
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=ca8e2b714b60125bf9d1d7c671f8761ea8d4be39
Submitter: Jenkins
Branch: master

commit ca8e2b714b60125bf9d1d7c671f8761ea8d4be39
Author: Vladimir Sharshov <email address hidden>
Date: Tue Feb 25 09:53:56 2014 +0400

Ignore inaccessible nodes when try to stop a deploy

Change-Id: I26854c24acffa1f56d6f5fb9c361ff77d000617e
Related-Bug: #1283812

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-25: Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/76098
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=511153a10a8e1d5bbc0bbfd9078eebed04bb22a1
Submitter: Jenkins
Branch: master

commit 511153a10a8e1d5bbc0bbfd9078eebed04bb22a1
Author: Vladimir Sharshov <email address hidden>
Date: Mon Feb 24 13:57:23 2014 +0400

New way to stop a main thread

    Use the kill instead of raise a custom exception.
    For some reason mcollective capture all exceptions
    if one of node becames inaccessible.

Bug 1282065 closes because the problem condition was deleted.

    Change-Id: Ia7b9ef9734883a470bea592c398359f75b807d45
    Closes-Bug: #1283812
    Closes-Bug: #1282065

Changed in fuel:
status:	In Progress → Fix Committed

Andrey Sledzinskiy (asledzinskiy) on 2014-02-26

tags:

added: in progress

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-02-26:

Verified on ISO #211

Revision:baa8bb07393698f1186cb67bb65f1b93907c59bd
origin/master

Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fuel-snapshot-2014-02-23_20-15-25.tgz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.