CentOS fails due to deployment timeout

Bug #1384332 reported by Ryan Moe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Sharshov

Bug Description

This happens on CentOS with Neutron VLAN.

The deployment times out because Astute keeps re-running puppet on the primary controller. There are no apparent errors in the puppet logs but Astute re-runs puppet because the lockfile created by daemonize doesn't get removed.

From the astute logs:

"Process not running but not empty lockfile is present. Trying to remove lockfile...ok."

After that Astute re-runs puppet and the same error occurs until finally the deployment times out.

Tags: astute
Revision history for this message
Ryan Moe (rmoe) wrote :
Changed in fuel:
assignee: nobody → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Ryan Moe (rmoe) is right.

For some reason daemonize do not remove lock file.

Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Finishing transaction 70038095814780
Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Storing state
Sun Oct 19 18:08:16 +0000 2014 Puppet (info): Creating state file /var/lib/puppet/state/state.yaml
Sun Oct 19 18:08:16 +0000 2014 Puppet (debug): Stored state in 0.09 seconds
Sun Oct 19 18:08:16 +0000 2014 Puppet (notice): Finished catalog run in 3802.51 seconds
Sun Oct 19 18:08:19 +0000 2014 Puppet (info): Loading facts in /etc/puppet/modules/corosync/lib/facter/pacemaker_hostname.rb

2014-10-19T18:08:18 debug: [411] 6b8d7ce1-9af5-45f7-89d5-7f4ef28ee238: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", <...> , :runtime=>2, :enabled=>1, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :version=>{"config"=>1413738274, "puppet"=>"3.4.2"}, :idling=>0}}

And again:

Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Finishing transaction 70257381676520
Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Storing state
Sun Oct 19 18:33:08 +0000 2014 Puppet (debug): Stored state in 0.36 seconds
Sun Oct 19 18:33:08 +0000 2014 Puppet (notice): Finished catalog run in 1461.06 seconds
Sun Oct 19 18:33:11 +0000 2014 Puppet (info): Loading facts in /etc/puppet/modules/corosync/lib/facter/pacemaker_hostname.rb

2014-10-19T18:33:11 debug: [411] 6b8d7ce1-9af5-45f7-89d5-7f4ef28ee238: MC agent 'puppetd', method 'last_run_summary', results: {:sender=>"3", :statuscode=>0, :statusmsg=>"OK", <...>, :runtime=>3, :enabled=>1, :err_msg=>"Process not running but not empty lockfile is present. Trying to remove lockfile...ok.", :version=>{"config"=>1413742101, "puppet"=>"3.4.2"}, :idling=>0}}

In other hand puppet logs contain many errors. This is unexpected behavior and we should try to reproduce it.
  release: "6.0"
  build_number: "104"
  build_id: "2014-10-19_18-46-46"

Changed in fuel:
status: New → Confirmed
importance: Undecided → High
milestone: none → 6.0
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Simple way to solve this issue. Add more time in this case do not solve problem.

Try to find is any changes was released in packages base related to puppet or daemonize.

tags: added: astute
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

No packages was changes. Decide to solve problem via change of puppet run detection mechanism.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

This behavior not a bug: with/without pid file of finished process will interprete as 'stopped'. Default behavior in this case: 3 runs or 1 run + 2 retries.

I suggest to decrease number of retries to 1. 1 run and 1 retries will save a lot of time in case of puppet fail.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/131696

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/131974

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/131974
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=24f5e3d44feba223e424c49676c1ead07ae1e284
Submitter: Jenkins
Branch: master

commit 24f5e3d44feba223e424c49676c1ead07ae1e284
Author: Sergey Vasilenko <email address hidden>
Date: Thu Oct 30 18:57:00 2014 +0300

    Optimize neutron-api waiting cycle

    Currently there're 60 attempts to reach Neutron API with short delays between
    them. It leads to very slow deployment failure in case of network problems,
    because client's connection to server hangs for 60 seconds itself.
    The idea is to change default timeout for Neutron HTTP connection.

    Change-Id: Icc4caa3c4eb3d22f24ff9ae7066ad1e60e65bb18
    Related-bug: #1384332

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/131696
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=3c374c9f7bfbdbcd7ce2f716cd704e3044e6fb41
Submitter: Jenkins
Branch: master

commit 3c374c9f7bfbdbcd7ce2f716cd704e3044e6fb41
Author: Vladimir Sharshov <email address hidden>
Date: Wed Oct 29 13:36:06 2014 +0300

    Decrease number of retries for puppet deployments

    Default behavior before: 3 puppet runs for the
    current role on the current node: 1 run + 2 retries

    Now we will run only 2 times: 1 run + 1 retries

    In 99 out of 100 cases, if the problem persists
    when we restart puppet, it will not disappear after
    next restart. That's why it makes no sense to run
    more than 2 times. Thus we save time, follow the
    ideology of 'fail fast', speed up deployment
    process, show user more detailed message about
    problem instead of 'deployment timeout'.

    Change-Id: Idd4b2a2e7ecdeff9ec298a2c5ceaea91f74cee4c
    Related-Bug: #1384332
    DocImpact

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Final result:

Message "Process not running but not empty lockfile is present. Trying to remove lockfile...ok." inform us about potential problem with puppet run and that is all. Puppet run again and failed/succeeded as expected. Real problem was time spent for puppet fail runs which take whole time for node. To prevent such behavior we do 2 things:

- decrease number of retries from 2 to 1 for puppet runs;
- decrease http-timeout from 60 to 4 seconds.

Both changes should help to mark node as error by puppet instead of error by timeout and got result much earlier.

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

> I suggest to decrease number of retries to 1. 1 run and 1 retries will save a lot of time in case of puppet fail.

It's a goodest behavior.

Some times some performance-critical services (i.e. Galera) may didn't start by its own timeout. And all resources, that depends of it will don't be ensured. But this service stand up while deployment process. In this case 2-d pass (1-st repeat) of puppet run will fix deployment.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.