successful deployment failed by astute

Bug #1356954 reported by Vladimir Kuklin on 2014-08-14
60
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Vladimir Sharshov
5.1.x
High
Andrey Sledzinskiy

Bug Description

Deploy nova ha vlan ubuntu cluster. Successful deployment is marked as failed by astute, although all nodes are in ready state and there are no errors in puppet logs.

Mike Scherbakov (mihgen) wrote :

Did you check astute logs? If astute failed to upload Cirros image, it would fail deployment

Vladimir Sharshov (vsharshov) wrote :

Node 3 with role controller fail because Astute via Mcollective lost access to this node.

2014-08-14T15:06:09 debug: [373] Retry #1 to run mcollective agent on nodes: '3'
2014-08-14T15:06:51 debug: [373] Retry #2 to run mcollective agent on nodes: '3'
2014-08-14T15:07:34 debug: [373] Retry #3 to run mcollective agent on nodes: '3'
2014-08-14T15:08:17 debug: [373] Retry #4 to run mcollective agent on nodes: '3'
2014-08-14T15:08:59 debug: [373] Retry #5 to run mcollective agent on nodes: '3'

2014-08-14T15:09:41 info: [373] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8240d9a6-bff8-4243-bc3b-d6d016e606fd", "nodes"=>[{"uid"=>"3", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}}

2014-08-14T15:23:54 warning: [373] Disabling the upload of disk image because deploy ended with an error

Dima Shulyak (dshulyak) wrote :

Mcollective log from node-3

2014-08-14T15:08:10.520658+01:00 debug: E, [2014-08-14T14:07:09.387545 #1119] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.108.25.2:61613: es_oldrecv: receive failed: execution expired

Dima Shulyak (dshulyak) wrote :

It looks like mcollective client in astute can not reestablish connection

I found related thread for this bug in mcollective itself: https://projects.puppetlabs.com/issues/20347

Vladimir Sharshov (vsharshov) wrote :

Reproduced on iso #423 same as in ticket. Try to reproduce using fresh iso #449

Changed in fuel:
status: Confirmed → Incomplete
Vladimir Sharshov (vsharshov) wrote :

#449 Could not reproduce.

Dima Shulyak (dshulyak) on 2014-08-20
Changed in fuel:
status: Incomplete → Confirmed
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on master ISO #461, system test: http://jenkins-product.srt.mirantis.net:8080/job/fuel_master.ubuntu.bvt_2/300/

/var/log/docker-logs/remote/node-2.test.domain.local/mcollective.log:

2014-08-20T15:29:25.798212+01:00 debug: E, [2014-08-20T14:28:51.107139 #1179] ERROR -- : rabbitmq.rb:30:in `on_miscerr' Unexpected error on connection stomp://mcollective@10.108.25.2:61613: es_oldrecv: receive failed: execution expired

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
Vladimir Sharshov (vsharshov) wrote :

Again same error.

2014-08-20T15:30:37 debug: [417] Retry #1 to run mcollective agent on nodes: '5'
2014-08-20T15:31:20 debug: [417] Retry #2 to run mcollective agent on nodes: '5'
2014-08-20T15:32:02 debug: [417] Retry #3 to run mcollective agent on nodes: '5'
2014-08-20T15:32:45 debug: [417] Retry #4 to run mcollective agent on nodes: '5'
2014-08-20T15:33:28 debug: [417] Retry #5 to run mcollective agent on nodes: '5'
2014-08-20T15:34:10 debug: [417] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"5", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

I ask DevOps team to looks node loading statistic for such time. Main idea: host system at this moment was overloading.

Fix proposed to branch: master
Review: https://review.openstack.org/115922

Changed in fuel:
status: Confirmed → In Progress
Vladimir Sharshov (vsharshov) wrote :

Host node was not very loaded in last case.

We still believe what this problem generated by overloading of node.

Why?

If we discovery mcollective log in problem node we can see timeout problem in https://github.com/stackforge/fuel-astute/blob/master/mcagents/puppetd.rb#L261

2014-08-20T15:34:14.490127+01:00 debug: E, [2014-08-20T14:34:09.497417 #1143] ERROR -- : agent.rb:112:in `handlemsg' /usr/lib/ruby/1.8/timeout.rb:64:in `puppet_pid'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:122:in `puppet_daemon_status'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:100:in `set_status'
2014-08-20T15:34:14.490127+01:00 debug: /usr/share/mcollective/plugins/mcollective/agent/puppetd.rb:53:in `last_run_summary_action'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `send'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `handlemsg'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:126:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/1.8/timeout.rb:67:in `timeout'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:125:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `initialize'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `new'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:121:in `dispatch'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:84:in `agentmsg'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:58:in `run'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `loop'
2014-08-20T15:34:14.490127+01:00 debug: /usr/lib/ruby/vendor_ruby/mcollective/runner.rb:53:in `run'
2014-08-20T15:34:14.490127+01:00 debug: /usr/sbin/mcollectived:52

What can we do?

- increase mcollective retries for puppet queries (Astute);
- decrease frequency of puppet state report via mcollective(Astute);
- increase ubuntu node memory limits in CI (DevOps).

Reviewed: https://review.openstack.org/115922
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=a50eadf804d82cbbf7af1b9744c07ecdfda2fec8
Submitter: Jenkins
Branch: master

commit a50eadf804d82cbbf7af1b9744c07ecdfda2fec8
Author: Vladimir Sharshov <email address hidden>
Date: Thu Aug 21 13:13:32 2014 +0400

    Fix problem 'MCollective agents didn't respond within the allotted time'

    - increase mcollective retries for puppet queries;
    - decrease frequency of puppet state report via mcollective.

    Change-Id: I61302e33deb68ace8e5f1d7980a099a2c300b215
    Co-Authored-By: Dima Shulyak <email address hidden>
    Closes-Bug: #1356954

Changed in fuel:
status: In Progress → Fix Committed
tags: added: astute

Similar problem when puppet hungs and astute marked deployment as failed
Cluster - Ubuntu, HA, Neutron GRE, 3 controllers, 2 compute nodes, 1 cinder

 2014-10-27 02:49:36 INFO

[421] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"8af7140e-4a10-49f2-89e9-f31543a3c7f3", "nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}}

2014-10-27 02:49:36 DEBUG

[421] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

2014-10-27 02:49:36 DEBUG

[421] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}, {"uid"=>"3", "progress"=>0, "status"=>"deploying", "role"=>"controller"}]}

2014-10-27 02:49:36 DEBUG

[421] Got progress for nodes: [{"uid"=>"3", "progress"=>0}]

2014-10-27 02:49:36 DEBUG

[421] Node "2" has failed to deploy. There is no more retries for puppet run.

2014-10-27 02:49:36 DEBUG

[421] Nodes statuses: {"succeed"=>[], "error"=>["2"], "running"=>["3"]}

2014-10-27 02:49:36 DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:output=>"Currently idling; last completed run 1414378176 seconds ago", :status=>"idling", :runtime=>1414378176, :changes=>nil, :enabled=>1, :events=>nil, :version=>nil, :idling=>1, :time=>{"last_run"=>0}, :stopped=>0, :running=>0, :resources=>{"total"=>0, "failed"=>1, "restarted"=>0, "out_of_sync"=>0, "changed"=>0}, :lastrun=>0}}

2014-10-27 02:49:36 DEBUG

[421] 8af7140e-4a10-49f2-89e9-f31543a3c7f3: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", :data=>{:idling=>0, :status=>"running", :runtime=>1414378176, :stopped=>0, :resources=>{"out_of_sync"=>0, "restarted"=>0, "failed"=>1, "total"=>0, "changed"=>0}, :lastrun=>0, :version=>nil, :output=>"Currently running; last completed run 1414378176 seconds ago", :time=>{"last_run"=>0}, :changes=>nil, :running=>1, :enabled=>1, :events=>nil}}

2014-10-27 02:49:32 WARNING

[421] Following nodes have puppet hung: '2'

Logs are attached

Dmitry Borodaenko (angdraug) wrote :

Andrey, please do not resurrect old unrelated bugs without proper investigation. It's better to raise a new bug and later find out that it is a duplicate than to pollute old closed bugs with irrelevant comments that confuse the hell out of users looking for a solution for their problem.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers