Activity log for bug #1260077

Date Who What changed Old value New value Message
2013-12-11 21:27:38 Dmitry Borodaenko bug added bug
2013-12-11 21:27:38 Dmitry Borodaenko attachment added fuel-snapshot-2013-12-11_20-57-09.tgz https://bugs.launchpad.net/bugs/1260077/+attachment/3927772/+files/fuel-snapshot-2013-12-11_20-57-09.tgz
2013-12-11 21:30:38 Dmitry Borodaenko description ISO #126. Cluster configuration: simple centos nova-network, 1x controller+ceph-osd, 1x compute+ceph-osd. Halfway through applying ceph-osd role on the compute node, astute decided that ceph-osd is finished (even though it went on for 2 more minutes), updated astute.yaml for the compute role and tried to run 'daemonize puppet apply'. daemonize probably detected the lock from the running instance of puppet and refused to start. Still, when ceph-osd role finally finished, astute decided that it was compute role that has finished, and marked the whole node as 100% completed. The problem is most likely related to: https://bugs.launchpad.net/fuel/+bug/1258358 Log analysis: Orchestrator starts applying ceph-osd role on node-2: 2013-12-11T20:26:34 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'runonce', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"stopped", :enabled=>1, :idling=>0, :running=>0, :runtime=>1386793594, :stopped=>1, :output=>"Called /usr/sbin/daemonize -a -e /var/log/puppet/puppet.err -o /var/log/puppet/puppet.log -l /tmp/puppetd.lock -p /tmp/puppetd.lock /usr/bin/puppet apply /etc/puppet/manifests/site.pp --logdest syslog --trace, Currently stopped; last completed run 1386793594 seconds ago", :lastrun=>0}} Puppet starts logging to daemon.log on node-2: <29>Dec 11 20:26:43 node-2 puppet-apply[3067]: (/Stage[first]/Rsyslog::Service/Service[rsyslog]) Triggered 'refresh' from 1 events <29>Dec 11 20:26:43 node-2 puppet-apply[3067]: (/Stage[netconfig]/L23network::Hosts_file/Host[node-2]/ensure) created <29>Dec 11 20:27:29 node-2 puppet-apply[3067]: (/Stage[netconfig]/L23network::L2/Package[kmod-openvswitch]/ensure) created Orchestrator decides that ceph-osd is completed (while it's still running the netconfig stage for ceph-osd!): 2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"1", :statuscode=>0, :statusmsg=>"OK", :data=>{:output=>"Currently stopped; last completed run 1 seconds ago", :status=>"stopped", :enabled=>1, :time=>{"filebucket"=>0.000183, "last_run"=>1386793618, "ceph_conf"=>0.008109, "exec"=>5.419706, "total"=>12.369046, "firewall"=>0.240772, "notify"=>0.025329, "package"=>0.007653, "config_retrieval"=>5.834052, "l3_if_downup"=>0.002296, "anchor"=>0.000381, "install_ssh_keys"=>0.004937, "service"=>0.718725, "file"=>0.104666, "cfg"=>0.001187, "host"=>0.00105}, :running=>0, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :idling=>0, :runtime=>1, :stopped=>1, :resources=>{"failed_to_restart"=>0, "skipped"=>6, "failed"=>0, "scheduled"=>0, "total"=>173, "out_of_sync"=>21, "changed"=>21, "restarted"=>1}, :changes=>{"total"=>21}, :version=>{"config"=>1386793596, "puppet"=>"2.7.23"}, :lastrun=>1386793618, :events=>{"total"=>21, "failure"=>0, "success"=>21}}} 2013-12-11T20:26:59 debug: [8930] Nodes statuses: {"succeed"=>["1"], "error"=>[], "running"=>[]} 2013-12-11T20:26:59 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"1", "status"=>"ready", "role"=>"ceph-osd"}]} Orchestrator uploads updated astute.yaml for the compute role: 2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'uploadfile', method 'upload', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:msg=>"File was uploaded!"}} -rw------- 1 root root 3617 Dec 11 20:26 /etc/astute.yaml Orchestrator decides not to run 'daemonize puppet apply' because it's already running (of course, it's not done with ceph-osd, yet): 2013-12-11T20:26:59 debug: [8930] Waiting for puppet to finish deployment on all nodes (timeout = 3600 sec)... 2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793619 seconds ago", :runtime=>1386793619, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} And then for no good reason it gives up on the instance of puppet that is still running and declare that it's running compute role now (even though it didn't tell MC to start new puppet apply process): 2013-12-11T20:27:58 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793678 seconds ago", :runtime=>1386793678, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} 2013-12-11T20:27:59 debug: [8930] puppetd_runonce completed within 60 seconds. 2013-12-11T20:27:59 debug: [8930] Following nodes have puppet hung: '2' 2013-12-11T20:27:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793679 seconds ago", :runtime=>1386793679, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} 2013-12-11T20:27:59 debug: [8930] Nodes statuses: {"succeed"=>[], "error"=>[], "running"=>["2"]} 2013-12-11T20:27:59 debug: [8930] Got progress for nodes: [{"uid"=>"2", "progress"=>0}] 2013-12-11T20:27:59 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "progress"=>0, "status"=>"deploying", "role"=>"compute"}]} 2013-12-11T20:28:01 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793681 seconds ago", :runtime=>1386793681, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} 2013-12-11T20:28:01 debug: [8930] Nodes statuses: {"succeed"=>[], "error"=>[], "running"=>["2"]} Despite what orchestrator have thought, puppet actually finishes applying ceph-osd role: <29>Dec 11 20:28:46 node-2 puppet-apply[3067]: (/Stage[main]/Ceph/Service[ceph]) Triggered 'refresh' from 1 events <29>Dec 11 20:28:46 node-2 puppet-apply[3067]: Finished catalog run in 124.51 seconds By that time, orchestrator thinks it was compute role that was running: 2013-12-11T20:28:47 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"stopped", :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :enabled=>1, :time=>{"filebucket"=>0.000219, "config_retrieval"=>5.978734, "l3_if_downup"=>0.002556, "install_ssh_keys"=>0.003044, "cfg"=>0.018418, "total"=>78.579702, "exec"=>6.548802, "host"=>0.007485, "service"=>0.607738, "firewall"=>5.541325, "package"=>59.656183, "anchor"=>0.000483, "notify"=>0.031705, "file"=>0.18301, "last_run"=>1386793726}, :idling=>0, :running=>0, :changes=>{"total"=>116}, :output=>"Currently stopped; last completed run 1 seconds ago", :runtime=>1, :stopped=>1, :resources=>{"failed"=>0, "total"=>166, "skipped"=>6, "changed"=>114, "failed_to_restart"=>0, "out_of_sync"=>114, "scheduled"=>0, "restarted"=>9}, :events=>{"success"=>116, "total"=>116, "failure"=>0}, :lastrun=>1386793726, :version=>{"config"=>1386793595, "puppet"=>"2.7.23"}}} 2013-12-11T20:28:47 debug: [8930] Nodes statuses: {"succeed"=>["2"], "error"=>[], "running"=>[]} 2013-12-11T20:28:47 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"ready", "role"=>"compute"}]} End result: cluster deployed "successfully", no Nova is deployed on the compute nodes, no errors reported in Fuel, everything looks fine until you try to launch an instance. ISO #126. Cluster configuration: simple centos nova-network, 1x controller+ceph-osd, 1x compute+ceph-osd. Halfway through applying ceph-osd role on the compute node, astute decided that ceph-osd is finished (even though it went on for 2 more minutes), and decided to start applying the compute role (but didn't really started it). Still, when ceph-osd role finally finished, astute decided that it was compute role that has finished, and marked the whole node as 100% completed. The problem is most likely related to: https://bugs.launchpad.net/fuel/+bug/1258358 Log analysis: Orchestrator starts applying ceph-osd role on node-2: 2013-12-11T20:26:34 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'runonce', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"stopped", :enabled=>1, :idling=>0, :running=>0, :runtime=>1386793594, :stopped=>1, :output=>"Called /usr/sbin/daemonize -a -e /var/log/puppet/puppet.err -o /var/log/puppet/puppet.log -l /tmp/puppetd.lock -p /tmp/puppetd.lock /usr/bin/puppet apply /etc/puppet/manifests/site.pp --logdest syslog --trace, Currently stopped; last completed run 1386793594 seconds ago", :lastrun=>0}} Puppet starts logging to daemon.log on node-2: <29>Dec 11 20:26:43 node-2 puppet-apply[3067]: (/Stage[first]/Rsyslog::Service/Service[rsyslog]) Triggered 'refresh' from 1 events <29>Dec 11 20:26:43 node-2 puppet-apply[3067]: (/Stage[netconfig]/L23network::Hosts_file/Host[node-2]/ensure) created <29>Dec 11 20:27:29 node-2 puppet-apply[3067]: (/Stage[netconfig]/L23network::L2/Package[kmod-openvswitch]/ensure) created Orchestrator decides that ceph-osd is completed (while it's still running the netconfig stage for ceph-osd!): 2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"1", :statuscode=>0, :statusmsg=>"OK", :data=>{:output=>"Currently stopped; last completed run 1 seconds ago", :status=>"stopped", :enabled=>1, :time=>{"filebucket"=>0.000183, "last_run"=>1386793618, "ceph_conf"=>0.008109, "exec"=>5.419706, "total"=>12.369046, "firewall"=>0.240772, "notify"=>0.025329, "package"=>0.007653, "config_retrieval"=>5.834052, "l3_if_downup"=>0.002296, "anchor"=>0.000381, "install_ssh_keys"=>0.004937, "service"=>0.718725, "file"=>0.104666, "cfg"=>0.001187, "host"=>0.00105}, :running=>0, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :idling=>0, :runtime=>1, :stopped=>1, :resources=>{"failed_to_restart"=>0, "skipped"=>6, "failed"=>0, "scheduled"=>0, "total"=>173, "out_of_sync"=>21, "changed"=>21, "restarted"=>1}, :changes=>{"total"=>21}, :version=>{"config"=>1386793596, "puppet"=>"2.7.23"}, :lastrun=>1386793618, :events=>{"total"=>21, "failure"=>0, "success"=>21}}} 2013-12-11T20:26:59 debug: [8930] Nodes statuses: {"succeed"=>["1"], "error"=>[], "running"=>[]} 2013-12-11T20:26:59 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"1", "status"=>"ready", "role"=>"ceph-osd"}]} Orchestrator uploads updated astute.yaml for the compute role: 2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'uploadfile', method 'upload', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:msg=>"File was uploaded!"}} -rw------- 1 root root 3617 Dec 11 20:26 /etc/astute.yaml Orchestrator decides not to run 'daemonize puppet apply' because it's already running (of course, it's not done with ceph-osd, yet): 2013-12-11T20:26:59 debug: [8930] Waiting for puppet to finish deployment on all nodes (timeout = 3600 sec)... 2013-12-11T20:26:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793619 seconds ago", :runtime=>1386793619, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} And then for no good reason it gives up on the instance of puppet that is still running and declare that it's running compute role now (even though it didn't tell MC to start new puppet apply process): 2013-12-11T20:27:58 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793678 seconds ago", :runtime=>1386793678, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} 2013-12-11T20:27:59 debug: [8930] puppetd_runonce completed within 60 seconds. 2013-12-11T20:27:59 debug: [8930] Following nodes have puppet hung: '2' 2013-12-11T20:27:59 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793679 seconds ago", :runtime=>1386793679, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} 2013-12-11T20:27:59 debug: [8930] Nodes statuses: {"succeed"=>[], "error"=>[], "running"=>["2"]} 2013-12-11T20:27:59 debug: [8930] Got progress for nodes: [{"uid"=>"2", "progress"=>0}] 2013-12-11T20:27:59 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "progress"=>0, "status"=>"deploying", "role"=>"compute"}]} 2013-12-11T20:28:01 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"running", :enabled=>1, :time=>{"last_run"=>0}, :idling=>0, :running=>1, :changes=>nil, :output=>"Currently running; last completed run 1386793681 seconds ago", :runtime=>1386793681, :stopped=>0, :resources=>{"failed"=>1, "total"=>0, "changed"=>0, "out_of_sync"=>0, "restarted"=>0}, :events=>nil, :lastrun=>0, :version=>nil}} 2013-12-11T20:28:01 debug: [8930] Nodes statuses: {"succeed"=>[], "error"=>[], "running"=>["2"]} Despite what orchestrator have thought, puppet actually finishes applying ceph-osd role: <29>Dec 11 20:28:46 node-2 puppet-apply[3067]: (/Stage[main]/Ceph/Service[ceph]) Triggered 'refresh' from 1 events <29>Dec 11 20:28:46 node-2 puppet-apply[3067]: Finished catalog run in 124.51 seconds By that time, orchestrator thinks it was compute role that was running: 2013-12-11T20:28:47 debug: [8930] eb42a65b-feb1-4be1-abe6-72ba6577c774: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"2", :statuscode=>0, :statusmsg=>"OK", :data=>{:status=>"stopped", :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :enabled=>1, :time=>{"filebucket"=>0.000219, "config_retrieval"=>5.978734, "l3_if_downup"=>0.002556, "install_ssh_keys"=>0.003044, "cfg"=>0.018418, "total"=>78.579702, "exec"=>6.548802, "host"=>0.007485, "service"=>0.607738, "firewall"=>5.541325, "package"=>59.656183, "anchor"=>0.000483, "notify"=>0.031705, "file"=>0.18301, "last_run"=>1386793726}, :idling=>0, :running=>0, :changes=>{"total"=>116}, :output=>"Currently stopped; last completed run 1 seconds ago", :runtime=>1, :stopped=>1, :resources=>{"failed"=>0, "total"=>166, "skipped"=>6, "changed"=>114, "failed_to_restart"=>0, "out_of_sync"=>114, "scheduled"=>0, "restarted"=>9}, :events=>{"success"=>116, "total"=>116, "failure"=>0}, :lastrun=>1386793726, :version=>{"config"=>1386793595, "puppet"=>"2.7.23"}}} 2013-12-11T20:28:47 debug: [8930] Nodes statuses: {"succeed"=>["2"], "error"=>[], "running"=>[]} 2013-12-11T20:28:47 debug: [8930] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"ready", "role"=>"compute"}]} End result: cluster deployed "successfully", no Nova is deployed on the compute nodes, no errors reported in Fuel, everything looks fine until you try to launch an instance.
2013-12-11 21:41:49 Dmitry Borodaenko marked as duplicate 1256244