Bug #1356954 “successful deployment failed by astute” : Bugs : Fuel for OpenStack

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-08-14:

#1

fail_error_deploy_ha_vlan-2014_08_14__14_25_33.tar.gz Edit (4.6 MiB, application/x-tar)

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-08-14:

#2

Did you check astute logs? If astute failed to upload Cirros image, it would fail deployment

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-08-15:

#3

Node 3 with role controller fail because Astute via Mcollective lost access to this node.

2014-08-14T15:06:09 debug: [373] Retry #1 to run mcollective agent on nodes: '3'
2014-08-14T15:06:51 debug: [373] Retry #2 to run mcollective agent on nodes: '3'
2014-08-14T15:07:34 debug: [373] Retry #3 to run mcollective agent on nodes: '3'
2014-08-14T15:08:17 debug: [373] Retry #4 to run mcollective agent on nodes: '3'
2014-08-14T15:08:59 debug: [373] Retry #5 to run mcollective agent on nodes: '3'

2014-08-14T15:09:41 info: [373] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8240d9a6-bff8-4243-bc3b-d6d016e606fd", "nodes"=>[{"uid"=>"3", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}}

2014-08-14T15:23:54 warning: [373] Disabling the upload of disk image because deploy ended with an error

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-08-18:

#4

Mcollective log from node-3

2014-08-14T15:08:10.520658+01:00 debug: E, [2014-08-14T14:07:09.387545 #1119] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.108.25.2:61613: es_oldrecv: receive failed: execution expired

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-08-18:

#5

It looks like mcollective client in astute can not reestablish connection

I found related thread for this bug in mcollective itself: https://projects.puppetlabs.com/issues/20347

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-08-18:

#6

Reproduced on iso #423 same as in ticket. Try to reproduce using fresh iso #449

Vladimir Sharshov (vsharshov) on 2014-08-18

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-08-19:

#7

#449 Could not reproduce.

Dima Shulyak (dshulyak) on 2014-08-20

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-08-20:

#8

Reproduced on master ISO #461, system test: http://jenkins-product.srt.mirantis.net:8080/job/fuel_master.ubuntu.bvt_2/300/

/var/log/docker-logs/remote/node-2.test.domain.local/mcollective.log:

2014-08-20T15:29:25.798212+01:00 debug: E, [2014-08-20T14:28:51.107139 #1179] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.108.25.2:61613: es_oldrecv: receive failed: execution expired

Nastya Urlapova (aurlapova) on 2014-08-20

Changed in fuel:
assignee:	Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-08-20:

#9

Again same error.

2014-08-20T15:30:37 debug: [417] Retry #1 to run mcollective agent on nodes: '5'
2014-08-20T15:31:20 debug: [417] Retry #2 to run mcollective agent on nodes: '5'
2014-08-20T15:32:02 debug: [417] Retry #3 to run mcollective agent on nodes: '5'
2014-08-20T15:32:45 debug: [417] Retry #4 to run mcollective agent on nodes: '5'
2014-08-20T15:33:28 debug: [417] Retry #5 to run mcollective agent on nodes: '5'
2014-08-20T15:34:10 debug: [417] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"5", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

I ask DevOps team to looks node loading statistic for such time. Main idea: host system at this moment was overloading.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-21: Fix proposed to fuel-astute (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/115922

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-08-21:

#11

Host node was not very loaded in last case.

We still believe what this problem generated by overloading of node.

Why?

If we discovery mcollective log in problem node we can see timeout problem in https://github.com/stackforge/fuel-astute/blob/master/mcagents/puppetd.rb#L261

2014-08-20T15:34:14.490127+01:00 debug: E, [2014-08-20T14:34:09.497417 #1143] ERROR -- : agent.rb:112:in `handlemsg' /usr/lib/ruby/1.8/timeout.rb:64:in `puppet_pid'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:122:in `puppet_daemon_status'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:100:in `set_status'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:53:in `last_run_summary_action'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `send'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `handlemsg'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:126:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/1.8/timeout.rb:67:in `timeout'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:125:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `initialize'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `new'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:84:in `agentmsg'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:58:in `run'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `loop'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `run'
2014-08-20T15:34:14.490127+01:00 debug: /usr/sbin/mcollectived:52

What can we do?

- increase mcollective retries for puppet queries (Astute);
- decrease frequency of puppet state report via mcollective(Astute);
- increase ubuntu node memory limits in CI (DevOps).

Host node was not very loaded in last case.

We still believe what this problem generated by overloading of node.

Why?

If we discovery mcollective log in problem node we can see timeout problem in https://github.com/stackforge/fuel-astute/blob/master/mcagents/puppetd.rb#L261

2014-08-20T15:34:14.490127+01:00 debug:  E, [2014-08-20T14:34:09.497417 #1143] ERROR -- : agent.rb:112:in `handlemsg' /usr/lib/ruby/1.8/timeout.rb:64:in `puppet_pid'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:122:in `puppet_daemon_status'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:100:in `set_status'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:53:in `last_run_summary_action'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `send'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `handlemsg'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/agents.rb:126:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/1.8/timeout.rb:67:in `timeout'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/agents.rb:125:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `initialize'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `new'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/runner.rb:84:in `agentmsg'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/runner.rb:58:in `run'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `loop'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `run'
2014-08-20T15:34:14.490127+01:00 debug:  	/usr/sbin/mcollectived:52

What can we do?

- increase mcollective retries for puppet queries (Astute);
- decrease frequency of puppet state report via mcollective(Astute);
- increase ubuntu node memory limits in CI (DevOps).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-21: Fix merged to fuel-astute (master)

#12

Reviewed: https://review.openstack.org/115922
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=a50eadf804d82cbbf7af1b9744c07ecdfda2fec8
Submitter: Jenkins
Branch: master

commit a50eadf804d82cbbf7af1b9744c07ecdfda2fec8
Author: Vladimir Sharshov <email address hidden>
Date: Thu Aug 21 13:13:32 2014 +0400

Fix problem 'MCollective agents didn't respond within the allotted time'

- increase mcollective retries for puppet queries;
- decrease frequency of puppet state report via mcollective.

    Change-Id: I61302e33deb68ace8e5f1d7980a099a2c300b215
    Co-Authored-By: Dima Shulyak <email address hidden>
    Closes-Bug: #1356954

Changed in fuel:
status:	In Progress → Fix Committed

Vladimir Sharshov (vsharshov) on 2014-08-28

tags:

added: astute

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-10-27:

#13

Similar problem when puppet hungs and astute marked deployment as failed
Cluster - Ubuntu, HA, Neutron GRE, 3 controllers, 2 compute nodes, 1 cinder

2014-10-27 02:49:36 INFO

[421] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8af7140e-4a10-49f2-89e9-f31543a3c7f3", "nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}}

2014-10-27 02:49:36 DEBUG

[421] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

2014-10-27 02:49:36 DEBUG

[421] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}, {"uid"=>"3", "progress"=>0, "status"=>"deploying", "role"=>"controller"}]}

2014-10-27 02:49:36 DEBUG

[421] Got progress for nodes: [{"uid"=>"3", "progress"=>0}]

2014-10-27 02:49:36 DEBUG

[421] Node "2" has failed to deploy. There is no more retries for puppet run.

2014-10-27 02:49:36 DEBUG

[421] Nodes statuses: {"succeed"=>[], "error"=>["2"], "running"=>["3"]}

2014-10-27 02:49:36 DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:output=>"Currently idling; last completed run 1414378176 seconds ago", :status=>"idling", :runtime=>1414378176, :changes=>nil, :enabled=>1, :events=>nil, :version=>nil, :idling=>1, :time=>{"last_run"=>0}, :stopped=>0, :running=>0, :resources=>{"total"=>0, "failed"=>1, "restarted"=>0, "out_of_sync"=>0, "changed"=>0}, :lastrun=>0}}

2014-10-27 02:49:36 DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", :data=>{:idling=>0, :status=>"running", :runtime=>1414378176, :stopped=>0, :resources=>{"out_of_sync"=>0, "restarted"=>0, "failed"=>1, "total"=>0, "changed"=>0}, :lastrun=>0, :version=>nil, :output=>"Currently running; last completed run 1414378176 seconds ago", :time=>{"last_run"=>0}, :changes=>nil, :running=>1, :enabled=>1, :events=>nil}}

2014-10-27 02:49:32 WARNING

[421] Following nodes have puppet hung: '2'

Logs are attached

Similar problem when puppet hungs and astute marked deployment as failed
Cluster - Ubuntu, HA, Neutron GRE, 3 controllers, 2 compute nodes, 1 cinder

2014-10-27 02:49:36 	INFO

[421] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8af7140e-4a10-49f2-89e9-f31543a3c7f3", "nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}}

2014-10-27 02:49:36 	DEBUG

[421] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

2014-10-27 02:49:36 	DEBUG

[421] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}, {"uid"=>"3", "progress"=>0, "status"=>"deploying", "role"=>"controller"}]}

2014-10-27 02:49:36 	DEBUG

[421] Got progress for nodes: [{"uid"=>"3", "progress"=>0}]

2014-10-27 02:49:36 	DEBUG

[421] Node "2" has failed to deploy. There is no more retries for puppet run.

2014-10-27 02:49:36 	DEBUG

[421] Nodes statuses: {"succeed"=>[], "error"=>["2"], "running"=>["3"]}

2014-10-27 02:49:36 	DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:output=>"Currently idling; last completed run 1414378176 seconds ago", :status=>"idling", :runtime=>1414378176, :changes=>nil, :enabled=>1, :events=>nil, :version=>nil, :idling=>1, :time=>{"last_run"=>0}, :stopped=>0, :running=>0, :resources=>{"total"=>0, "failed"=>1, "restarted"=>0, "out_of_sync"=>0, "changed"=>0}, :lastrun=>0}}

2014-10-27 02:49:36 	DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", :data=>{:idling=>0, :status=>"running", :runtime=>1414378176, :stopped=>0, :resources=>{"out_of_sync"=>0, "restarted"=>0, "failed"=>1, "total"=>0, "changed"=>0}, :lastrun=>0, :version=>nil, :output=>"Currently running; last completed run 1414378176 seconds ago", :time=>{"last_run"=>0}, :changes=>nil, :running=>1, :enabled=>1, :events=>nil}}

2014-10-27 02:49:32 	WARNING

[421] Following nodes have puppet hung: '2'

Logs are attached

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-10-27:

#14

fail_error_deploy_ha_neutron-2014_10_27__03_18_29.tar.gz Edit (6.2 MiB, application/x-tar)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-11-05:

#15

Andrey, please do not resurrect old unrelated bugs without proper investigation. It's better to raise a new bug and later find out that it is a duplicate than to pollute old closed bugs with irrelevant comments that confuse the hell out of users looking for a solution for their problem.

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	Critical	Vladimir Sharshov	Fuel for OpenStack 5.1
	5.1.x	Invalid	High	Andrey Sledzinskiy	Fuel for OpenStack 5.1.1

Fuel for OpenStack

successful deployment failed by astute

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches