Astute reports that puppet is hanged on 2nd controller and fails deployment while puppet run was successful

Bug #1392779 reported by Artem Panchenko
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Sharshov
5.1.x
Won't Fix
High
Vladimir Sharshov
6.0.x
Won't Fix
High
Vladimir Sharshov

Bug Description

[root@nailgun ~]# fuel --f
api: '1.0'
astute_sha: 0085021fe327f6f910901b3ca55051b1df33a96e
auth_required: true
build_id: 2014-11-14_11-06-51
build_number: '107'
feature_groups:
- mirantis
fuellib_sha: 37abac00bc2e5d3d2cac7d512c2cabde05428797
fuelmain_sha: caf72f3e776814371adcec70ab46e37a867bedeb
nailgun_sha: 37a8122f323aa21a1d5858f2e4f0318c64fdda3f
ostf_sha: 82465a94eed4eff1fc8d8e1f2fb7e9993c22f068
production: docker
release: '6.0'

Reproduced on latest 6.0 iso:

http://jenkins-product.srt.mirantis.net:8080/job/6.0.ubuntu.bvt_2/88/testReport/junit/(root)/deploy_ha_vlan/deploy_ha_vlan/

Astute reported that puppet on the 2nd controller hanged:

2014-11-14T13:20:01 warning: [399] Following nodes have puppet hung: '2'
...
2014-11-14T13:20:03 debug: [399] Nodes statuses: {"succeed"=>[], "error"=>["2"], "running"=>["5"]}
2014-11-14T13:20:03 debug: [399] Node "2" has failed to deploy. There is no more retries for puppet run.
...
2014-11-14T13:20:03 debug: [399] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}, {"uid"=>"5", "progress"=>0, "status"=>"deploying", "role"=>"controller"}]}
2014-11-14T13:20:03 debug: [399] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

But there were no 'puppet' processes on than slave (node-2) and here is its puppet report:

http://paste.openstack.org/show/133185/

Puppet logs contained no errors and 'crm status' showed that node is online. Also it successfully reported about status via mcollective:

http://paste.openstack.org/show/133171/

Re-deployment of environment was successful. Probably this issue is related to https://bugs.launchpad.net/mos/+bug/1385079

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Changed in fuel:
status: New → Confirmed
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

No more bug reproducing as i can see. Mark it as duplicate of https://bugs.launchpad.net/mos/+bug/1385079

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

It was reproduced again. Open bug for future investigation.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Release notes: Occasionally the problem described in the bug is really happened. In mostly case simple redeploy for problem cluster should solve the problem.

Reason to move from 6.0 to 6.1: to fix it we should change mechanism which used for detects state of puppet on nodes. This is big changes for release which go to hard freeze from day to day.

tags: added: docs
Changed in fuel:
milestone: 6.0 → 6.1
no longer affects: fuel/6.1.x
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138790

Revision history for this message
Mykhaylo Slobodyan (mykhaylo-slobodyan) wrote :

Adding sighting on 5.1.1-48 RC3. Reproduced using deploy_ha_flat autotest. After deploy, one of two compute nodes was in error state. Pupped run didn't produced any errors and finished successfully, but in logs I see:

2014-12-08 16:39:48 WARNING [436] Following nodes have puppet hung: '4'

Simple redeploy solved the problem with that compute node. Logs attached.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Download full text (3.5 KiB)

Reproduced on 6.0 exactly as described in https://bugs.launchpad.net/mos/+bug/1385079/comments/11

What happened:
# time frame in diagnostic snapshot:
# from 2014-12-30 12:16:00 to 2014-12-30 12:18:30
# nodes: 7, 12 and 13 were deploying, deploying of nodes 7 and 12 were failed.

1) from 'deploy_nodes', was prepared and called method 'PuppetTask.run()' for each of nodes 7, 12 and 13
   https://github.com/stackforge/fuel-astute/blob/master/lib/astute/deployment_engine/granular_deployment.rb#L88

2) First check of puppet status just to store the status (2014-12-30T12:16:01, method 'last_run_summary' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L36

3) Second check of puppet status to make sure that puppet is stopped (2014-12-30T12:16:01, method 'last_run_summary' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L121

4) Run the puppet agent on the nodes (2014-12-30T12:16:01, method 'runonce' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L128

5) Next puppet status check should be done only after "sleep Astute.config.PUPPET_FADE_INTERVAL"
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L131

    But puppet status is checked immediately after task run here (2014-12-30T12:16:02, method 'last_run_summary' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/deployment_engine/granular_deployment.rb#L94

    This caused removing of lock file on node 7 and 12 before the puppet process was started:
    ... "Process not running but not empty lockfile is present. Trying to remove lockfile...ok."
    http://paste.openstack.org/show/155223/

As the result: puppet status returned 'idling' for nodes 7 and 12 despite the fact that puppet was run successfully. Two minutes later astute failed deploy:
2014-12-30T12:18:27 warning: [393] Following nodes have puppet hung: '7,12'
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L134

Suggestions:

- move 'sleep Astute.config.PUPPET_DEPLOY_INTERVAL' from L119 to L94 (put it before 'case status = check_status(node_id)') in the 'granular_deployment.rb'

- consider possibility not to delete '/tmp/puppetd.lock' each time when just status is checked:
   https://github.com/stackforge/fuel-astute/blob/master/mcagents/puppetd.rb#L128
   This lock file has no sense at the moment and can be removed in the "runonce" method before checking 'set_status'.

p.s.: please be noted that there were two simultaneously running deploys in the diagnostic snapshot.

{"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "auth_required": true, "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c32...

Read more...

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/146445

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-astute (master)

Change abandoned by Vladimir Sharshov (<email address hidden>) on branch: master
Review: https://review.openstack.org/138790
Reason: Changes including in this changes:
- https://review.openstack.org/#/c/146464/
- https://review.openstack.org/#/c/141320/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/146445
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=072dc2b253b2930a4fe3612014c948d212c5fc08
Submitter: Jenkins
Branch: master

commit 072dc2b253b2930a4fe3612014c948d212c5fc08
Author: Dennis Dmitriev <email address hidden>
Date: Mon Jan 12 14:17:38 2015 +0200

    Avoid of calling 'puppet_daemon_status' twice at the same moment

    If MCollective agent calls 'puppet_daemon_status' from puppetd.rb
    twice without a delay, then this method can remove 'puppetd.lock'
    before puppet process appeared. This cause a wrong puppet status
    detection - 'idle' , despite the fact that puppet has been started.

    Let's after run_task(), where 'puppet_daemon_status' is called
    inside the MCollective script, wait *before* we call check_status()
    from Astute.

    Change-Id: I2a6e2cf86b58b4c758bfca77addfbb8e4943f3d0
    Related-Bug: #1392779

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.