astute tries to apply second role before first role finished applying

Bug #1260077 reported by Dmitry Borodaenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
New
Critical
Vladimir Sharshov

Bug Description

ISO #126. Cluster configuration: simple centos nova-network, 1x controller+ceph-osd, 1x compute+ceph-osd.

Halfway through applying ceph-osd role on the compute node, astute decided that ceph-osd is finished (even though it went on for 2 more minutes), and decided to start applying the compute role (but didn't really started it). Still, when ceph-osd role finally finished, astute decided that it was compute role that has finished, and marked the whole node as 100% completed.

The problem is most likely related to: https://bugs.launchpad.net/fuel/+bug/1258358

Log analysis:

Orchestrator starts applying ceph-osd role on node-2:
2013-12-11T20:26:34 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'runonce', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"stopped", :enabled=>1, :idling=>0, :running=>0, :runtime=>1386793594, :stopped=>1, :output=>"Called /usr/sbin/daemonize -a -e /var/log/puppet/puppet.err -o /var/log/puppet/puppet.log -l /tmp/puppetd.lock -p /tmp/puppetd.lock /usr/bin/puppet apply /etc/puppet/manifests/site.pp --logdest syslog --trace, Currently stopped; last completed run 1386793594 seconds ago", :lastrun=>0}}

Puppet starts logging to daemon.log on node-2:
<29>Dec 11 20:26:43 node-2 puppet-apply[3067]: (/Stage[first]/Rsyslog::Service/Service[rsyslog]) Triggered 'refresh' from 1 events
<29>Dec 11 20:26:43 node-2 puppet-apply[3067]: (/Stage[netconfig]/L23network::Hosts_file/Host[node-2]/ensure) created
<29>Dec 11 20:27:29 node-2 puppet-apply[3067]: (/Stage[netconfig]/L23network::L2/Package[kmod-openvswitch]/ensure) created

Orchestrator decides that ceph-osd is completed (while it's still running the netconfig stage for ceph-osd!):
2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"1", :statuscode=>0, :statusmsg=>"OK", :data=>{:output=>"Currently stopped; last completed run 1 seconds ago", :status=>"stopped", :enabled=>1, :time=>{"filebucket"=>0.000183, "last_run"=>1386793618, "ceph_conf"=>0.008109, "exec"=>5.419706, "total"=>12.369046, "firewall"=>0.240772, "notify"=>0.025329, "package"=>0.007653, "config_retrieval"=>5.834052, "l3_if_downup"=>0.002296, "anchor"=>0.000381, "install_ssh_keys"=>0.004937, "service"=>0.718725, "file"=>0.104666, "cfg"=>0.001187, "host"=>0.00105}, :running=>0, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :idling=>0, :runtime=>1, :stopped=>1, :resources=>{"failed_to_restart"=>0, "skipped"=>6, "failed"=>0, "scheduled"=>0, "total"=>173, "out_of_sync"=>21, "changed"=>21, "restarted"=>1}, :changes=>{"total"=>21}, :version=>{"config"=>1386793596, "puppet"=>"2.7.23"}, :lastrun=>1386793618, :events=>{"total"=>21, "failure"=>0, "success"=>21}}}
2013-12-11T20:26:59 debug: [8930] Nodes statuses: {"succeed"=>["1"], "error"=>[], "running"=>[]}
2013-12-11T20:26:59 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"1", "status"=>"ready", "role"=>"ceph-osd"}]}

Orchestrator uploads updated astute.yaml for the compute role:
2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'uploadfile', method 'upload', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:msg=>"File was uploaded!"}}
-rw------- 1 root root 3617 Dec 11 20:26 /etc/astute.yaml

Orchestrator decides not to run 'daemonize puppet apply' because it's already running (of course, it's not done with ceph-osd, yet):
2013-12-11T20:26:59 debug: [8930] Waiting for puppet to finish deployment on all nodes (timeout = 3600 sec)...
2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793619 seconds ago", :runtime=>1386793619, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}}

And then for no good reason it gives up on the instance of puppet that is still running and declare that it's running compute role now (even though it didn't tell MC to start new puppet apply process):
2013-12-11T20:27:58 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793678 seconds ago", :runtime=>1386793678, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}}
2013-12-11T20:27:59 debug: [8930] puppetd_runonce completed within 60 seconds.
2013-12-11T20:27:59 debug: [8930] Following nodes have puppet hung: '2'
2013-12-11T20:27:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793679 seconds ago", :runtime=>1386793679, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}}
2013-12-11T20:27:59 debug: [8930] Nodes statuses: {"succeed"=>[], "error"=>[], "running"=>["2"]}
2013-12-11T20:27:59 debug: [8930] Got progress for nodes: [{"uid"=>"2", "progress"=>0}]
2013-12-11T20:27:59 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "progress"=>0, "status"=>"deploying", "role"=>"compute"}]}
2013-12-11T20:28:01 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793681 seconds ago", :runtime=>1386793681, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}}
2013-12-11T20:28:01 debug: [8930] Nodes statuses: {"succeed"=>[], "error"=>[], "running"=>["2"]}

Despite what orchestrator have thought, puppet actually finishes applying ceph-osd role:
<29>Dec 11 20:28:46 node-2 puppet-apply[3067]: (/Stage[main]/Ceph/Service[ceph]) Triggered 'refresh' from 1 events
<29>Dec 11 20:28:46 node-2 puppet-apply[3067]: Finished catalog run in 124.51 seconds

By that time, orchestrator thinks it was compute role that was running:
2013-12-11T20:28:47 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"stopped", :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :enabled=>1, :time=>{"filebucket"=>0.000219, "config_retrieval"=>5.978734, "l3_if_downup"=>0.002556, "install_ssh_keys"=>0.003044, "cfg"=>0.018418, "total"=>78.579702, "exec"=>6.548802, "host"=>0.007485, "service"=>0.607738, "firewall"=>5.541325, "package"=>59.656183, "anchor"=>0.000483, "notify"=>0.031705, "file"=>0.18301, "last_run"=>1386793726}, :idling=>0, :running=>0, :changes=>{"total"=>116}, :output=>"Currently stopped; last completed run 1 seconds ago", :runtime=>1, :stopped=>1, :resources=>{"failed"=>0, "total"=>166, "skipped"=>6, "changed"=>114, "failed_to_restart"=>0, "out_of_sync"=>114, "scheduled"=>0, "restarted"=>9}, :events=>{"success"=>116, "total"=>116, "failure"=>0}, :lastrun=>1386793726, :version=>{"config"=>1386793595, "puppet"=>"2.7.23"}}}
2013-12-11T20:28:47 debug: [8930] Nodes statuses: {"succeed"=>["2"], "error"=>[], "running"=>[]}
2013-12-11T20:28:47 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"ready", "role"=>"compute"}]}

End result: cluster deployed "successfully", no Nova is deployed on the compute nodes, no errors reported in Fuel, everything looks fine until you try to launch an instance.

Tags: astute
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.