Bug #1384332 “CentOS fails due to deployment timeout” : Bugs : Fuel for OpenStack

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-10-22:

#1

fail_error_deploy_neutron_vlan_ha-2014_10_19__18_35_40.tar.gz Edit (5.5 MiB, application/x-tar)

Changed in fuel:
assignee:	nobody → Vladimir Sharshov (vsharshov)

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-10-23:

#2

Ryan Moe (rmoe) is right.

For some reason daemonize do not remove lock file.

Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Finishing transaction 70038095814780
Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Storing state
Sun Oct 19 18:08:16 +0000 2014 Puppet (info): Creating state file /var/lib/puppet/state/state.yaml
Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Stored state in 0.09 seconds
Sun Oct 19 18:08:16 +0000 2014 Puppet (notice): Finished catalog run in 3802.51 seconds
Sun Oct 19 18:08:19 +0000 2014 Puppet (info): Loading facts in /etc/puppet/modules/corosync/lib/facter/pacemaker_hostname.rb

2014-10-19T18:08:18 debug: [411] 6b8d7ce1-9af5-45f7-89d5-7f4ef28ee238: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", <...> , :runtime=>2, :enabled=>1, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :version=>{"config"=>1413738274, "puppet"=>"3.4.2"}, :idling=>0}}

And again:

Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Finishing transaction 70257381676520
Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Storing state
Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Stored state in 0.36 seconds
Sun Oct 19 18:33:08 +0000 2014 Puppet (notice): Finished catalog run in 1461.06 seconds
Sun Oct 19 18:33:11 +0000 2014 Puppet (info): Loading facts in /etc/puppet/modules/corosync/lib/facter/pacemaker_hostname.rb

2014-10-19T18:33:11 debug: [411] 6b8d7ce1-9af5-45f7-89d5-7f4ef28ee238: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", <...>, :runtime=>3, :enabled=>1, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :version=>{"config"=>1413742101, "puppet"=>"3.4.2"}, :idling=>0}}

In other hand puppet logs contain many errors. This is unexpected behavior and we should try to reproduce it.
  release: "6.0"
  build_number: "104"
  build_id: "2014-10-19_18-46-46"

Ryan Moe (rmoe) is right.

For some reason daemonize do not remove lock file.

Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Finishing transaction 70038095814780
Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Storing state
Sun Oct 19 18:08:16 +0000 2014 Puppet (info): Creating state file /var/lib/puppet/state/state.yaml
Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Stored state in 0.09 seconds
Sun Oct 19 18:08:16 +0000 2014 Puppet (notice): Finished catalog run in 3802.51 seconds
Sun Oct 19 18:08:19 +0000 2014 Puppet (info): Loading facts in /etc/puppet/modules/corosync/lib/facter/pacemaker_hostname.rb

2014-10-19T18:08:18 debug: [411] 6b8d7ce1-9af5-45f7-89d5-7f4ef28ee238: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", <...> , :runtime=>2, :enabled=>1, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :version=>{"config"=>1413738274, "puppet"=>"3.4.2"}, :idling=>0}}

And again:

Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Finishing transaction 70257381676520
Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Storing state
Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Stored state in 0.36 seconds
Sun Oct 19 18:33:08 +0000 2014 Puppet (notice): Finished catalog run in 1461.06 seconds
Sun Oct 19 18:33:11 +0000 2014 Puppet (info): Loading facts in /etc/puppet/modules/corosync/lib/facter/pacemaker_hostname.rb

2014-10-19T18:33:11 debug: [411] 6b8d7ce1-9af5-45f7-89d5-7f4ef28ee238: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", <...>, :runtime=>3, :enabled=>1, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :version=>{"config"=>1413742101, "puppet"=>"3.4.2"}, :idling=>0}}

In other hand puppet logs contain many errors. This is unexpected behavior and we should try to reproduce it.
  release: "6.0"
  build_number: "104"
  build_id: "2014-10-19_18-46-46"

Vladimir Sharshov (vsharshov) on 2014-10-23

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → High
milestone:	none → 6.0

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-10-24:

#3

Simple way to solve this issue. Add more time in this case do not solve problem.

Try to find is any changes was released in packages base related to puppet or daemonize.

Dmitry Borodaenko (angdraug) on 2014-10-24

tags:

added: astute

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-10-28:

#4

No packages was changes. Decide to solve problem via change of puppet run detection mechanism.

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-10-29:

#5

This behavior not a bug: with/without pid file of finished process will interprete as 'stopped'. Default behavior in this case: 3 runs or 1 run + 2 retries.

I suggest to decrease number of retries to 1. 1 run and 1 retries will save a lot of time in case of puppet fail.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-29: Fix proposed to fuel-astute (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/131696

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-30: Related fix proposed to fuel-library (master)

#7

Related fix proposed to branch: master
Review: https://review.openstack.org/131974

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-06: Related fix merged to fuel-library (master)

#8

Reviewed: https://review.openstack.org/131974
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=24f5e3d44feba223e424c49676c1ead07ae1e284
Submitter: Jenkins
Branch: master

commit 24f5e3d44feba223e424c49676c1ead07ae1e284
Author: Sergey Vasilenko <email address hidden>
Date: Thu Oct 30 18:57:00 2014 +0300

Optimize neutron-api waiting cycle

    Currently there're 60 attempts to reach Neutron API with short delays between
    them. It leads to very slow deployment failure in case of network problems,
    because client's connection to server hangs for 60 seconds itself.
    The idea is to change default timeout for Neutron HTTP connection.

Change-Id: Icc4caa3c4eb3d22f24ff9ae7066ad1e60e65bb18
Related-bug: #1384332

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-07: Related fix merged to fuel-astute (master)

#9

Reviewed: https://review.openstack.org/131696
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=3c374c9f7bfbdbcd7ce2f716cd704e3044e6fb41
Submitter: Jenkins
Branch: master

commit 3c374c9f7bfbdbcd7ce2f716cd704e3044e6fb41
Author: Vladimir Sharshov <email address hidden>
Date: Wed Oct 29 13:36:06 2014 +0300

Decrease number of retries for puppet deployments

Default behavior before: 3 puppet runs for the
current role on the current node: 1 run + 2 retries

Now we will run only 2 times: 1 run + 1 retries

    In 99 out of 100 cases, if the problem persists
    when we restart puppet, it will not disappear after
    next restart. That's why it makes no sense to run
    more than 2 times. Thus we save time, follow the
    ideology of 'fail fast', speed up deployment
    process, show user more detailed message about
    problem instead of 'deployment timeout'.

    Change-Id: Idd4b2a2e7ecdeff9ec298a2c5ceaea91f74cee4c
    Related-Bug: #1384332
    DocImpact

Vladimir Sharshov (vsharshov) on 2014-11-10

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-11-10:

#10

Final result:

Message "Process not running but not empty lockfile is present. Trying to remove lockfile...ok." inform us about potential problem with puppet run and that is all. Puppet run again and failed/succeeded as expected. Real problem was time spent for puppet fail runs which take whole time for node. To prevent such behavior we do 2 things:

- decrease number of retries from 2 to 1 for puppet runs;
- decrease http-timeout from 60 to 4 seconds.

Both changes should help to mark node as error by puppet instead of error by timeout and got result much earlier.

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2014-11-10:

#11

> I suggest to decrease number of retries to 1. 1 run and 1 retries will save a lot of time in case of puppet fail.

It's a goodest behavior.

Some times some performance-critical services (i.e. Galera) may didn't start by its own timeout. And all resources, that depends of it will don't be ensured. But this service stand up while deployment process. In this case 2-d pass (1-st repeat) of puppet run will fix deployment.

Fuel for OpenStack

CentOS fails due to deployment timeout

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches