successful deployment failed by astute

Bug #1356954 reported by Vladimir Kuklin
60
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Critical
Vladimir Sharshov
5.1.x
Invalid
High
Andrey Sledzinskiy

Bug Description

Deploy nova ha vlan ubuntu cluster. Successful deployment is marked as failed by astute, although all nodes are in ready state and there are no errors in puppet logs.

Tags: astute
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Did you check astute logs? If astute failed to upload Cirros image, it would fail deployment

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Node 3 with role controller fail because Astute via Mcollective lost access to this node.

2014-08-14T15:06:09 debug: [373] Retry #1 to run mcollective agent on nodes: '3'
2014-08-14T15:06:51 debug: [373] Retry #2 to run mcollective agent on nodes: '3'
2014-08-14T15:07:34 debug: [373] Retry #3 to run mcollective agent on nodes: '3'
2014-08-14T15:08:17 debug: [373] Retry #4 to run mcollective agent on nodes: '3'
2014-08-14T15:08:59 debug: [373] Retry #5 to run mcollective agent on nodes: '3'

2014-08-14T15:09:41 info: [373] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8240d9a6-bff8-4243-bc3b-d6d016e606fd", "nodes"=>[{"uid"=>"3", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}}

2014-08-14T15:23:54 warning: [373] Disabling the upload of disk image because deploy ended with an error

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Mcollective log from node-3

2014-08-14T15:08:10.520658+01:00 debug: E, [2014-08-14T14:07:09.387545 #1119] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.108.25.2:61613: es_oldrecv: receive failed: execution expired

Revision history for this message
Dima Shulyak (dshulyak) wrote :

It looks like mcollective client in astute can not reestablish connection

I found related thread for this bug in mcollective itself: https://projects.puppetlabs.com/issues/20347

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Reproduced on iso #423 same as in ticket. Try to reproduce using fresh iso #449

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

#449 Could not reproduce.

Dima Shulyak (dshulyak)
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on master ISO #461, system test: http://jenkins-product.srt.mirantis.net:8080/job/fuel_master.ubuntu.bvt_2/300/

/var/log/docker-logs/remote/node-2.test.domain.local/mcollective.log:

2014-08-20T15:29:25.798212+01:00 debug: E, [2014-08-20T14:28:51.107139 #1179] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.108.25.2:61613: es_oldrecv: receive failed: execution expired

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Again same error.

2014-08-20T15:30:37 debug: [417] Retry #1 to run mcollective agent on nodes: '5'
2014-08-20T15:31:20 debug: [417] Retry #2 to run mcollective agent on nodes: '5'
2014-08-20T15:32:02 debug: [417] Retry #3 to run mcollective agent on nodes: '5'
2014-08-20T15:32:45 debug: [417] Retry #4 to run mcollective agent on nodes: '5'
2014-08-20T15:33:28 debug: [417] Retry #5 to run mcollective agent on nodes: '5'
2014-08-20T15:34:10 debug: [417] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"5", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

I ask DevOps team to looks node loading statistic for such time. Main idea: host system at this moment was overloading.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/115922

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Host node was not very loaded in last case.

We still believe what this problem generated by overloading of node.

Why?

If we discovery mcollective log in problem node we can see timeout problem in https://github.com/stackforge/fuel-astute/blob/master/mcagents/puppetd.rb#L261

2014-08-20T15:34:14.490127+01:00 debug: E, [2014-08-20T14:34:09.497417 #1143] ERROR -- : agent.rb:112:in `handlemsg' /usr/lib/ruby/1.8/timeout.rb:64:in `puppet_pid'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:122:in `puppet_daemon_status'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:100:in `set_status'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:53:in `last_run_summary_action'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `send'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `handlemsg'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:126:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/1.8/timeout.rb:67:in `timeout'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:125:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `initialize'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `new'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:84:in `agentmsg'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:58:in `run'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `loop'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `run'
2014-08-20T15:34:14.490127+01:00 debug: /usr/sbin/mcollectived:52

What can we do?

- increase mcollective retries for puppet queries (Astute);
- decrease frequency of puppet state report via mcollective(Astute);
- increase ubuntu node memory limits in CI (DevOps).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/115922
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=a50eadf804d82cbbf7af1b9744c07ecdfda2fec8
Submitter: Jenkins
Branch: master

commit a50eadf804d82cbbf7af1b9744c07ecdfda2fec8
Author: Vladimir Sharshov <email address hidden>
Date: Thu Aug 21 13:13:32 2014 +0400

    Fix problem 'MCollective agents didn't respond within the allotted time'

    - increase mcollective retries for puppet queries;
    - decrease frequency of puppet state report via mcollective.

    Change-Id: I61302e33deb68ace8e5f1d7980a099a2c300b215
    Co-Authored-By: Dima Shulyak <email address hidden>
    Closes-Bug: #1356954

Changed in fuel:
status: In Progress → Fix Committed
tags: added: astute
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Similar problem when puppet hungs and astute marked deployment as failed
Cluster - Ubuntu, HA, Neutron GRE, 3 controllers, 2 compute nodes, 1 cinder

 2014-10-27 02:49:36 INFO

[421] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8af7140e-4a10-49f2-89e9-f31543a3c7f3", "nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}}

2014-10-27 02:49:36 DEBUG

[421] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

2014-10-27 02:49:36 DEBUG

[421] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}, {"uid"=>"3", "progress"=>0, "status"=>"deploying", "role"=>"controller"}]}

2014-10-27 02:49:36 DEBUG

[421] Got progress for nodes: [{"uid"=>"3", "progress"=>0}]

2014-10-27 02:49:36 DEBUG

[421] Node "2" has failed to deploy. There is no more retries for puppet run.

2014-10-27 02:49:36 DEBUG

[421] Nodes statuses: {"succeed"=>[], "error"=>["2"], "running"=>["3"]}

2014-10-27 02:49:36 DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:output=>"Currently idling; last completed run 1414378176 seconds ago", :status=>"idling", :runtime=>1414378176, :changes=>nil, :enabled=>1, :events=>nil, :version=>nil, :idling=>1, :time=>{"last_run"=>0}, :stopped=>0, :running=>0, :resources=>{"total"=>0, "failed"=>1, "restarted"=>0, "out_of_sync"=>0, "changed"=>0}, :lastrun=>0}}

2014-10-27 02:49:36 DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", :data=>{:idling=>0, :status=>"running", :runtime=>1414378176, :stopped=>0, :resources=>{"out_of_sync"=>0, "restarted"=>0, "failed"=>1, "total"=>0, "changed"=>0}, :lastrun=>0, :version=>nil, :output=>"Currently running; last completed run 1414378176 seconds ago", :time=>{"last_run"=>0}, :changes=>nil, :running=>1, :enabled=>1, :events=>nil}}

2014-10-27 02:49:32 WARNING

[421] Following nodes have puppet hung: '2'

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Andrey, please do not resurrect old unrelated bugs without proper investigation. It's better to raise a new bug and later find out that it is a duplicate than to pollute old closed bugs with irrelevant comments that confuse the hell out of users looking for a solution for their problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.