Fuel for OpenStack

Astute reports that puppet is hanged on 2nd controller and fails deployment while puppet run was successful

Bug #1392779 reported by Artem Panchenko on 2014-11-14

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Vladimir Sharshov	Fuel for OpenStack 6.1
5.1.x	Won't Fix	High	Vladimir Sharshov	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	Vladimir Sharshov	Fuel for OpenStack 6.0

Bug Description

[root@nailgun ~]# fuel --f
api: '1.0'
astute_sha: 0085021fe327f6f910901b3ca55051b1df33a96e
auth_required: true
build_id: 2014-11-14_11-06-51
build_number: '107'
feature_groups:
- mirantis
fuellib_sha: 37abac00bc2e5d3d2cac7d512c2cabde05428797
fuelmain_sha: caf72f3e776814371adcec70ab46e37a867bedeb
nailgun_sha: 37a8122f323aa21a1d5858f2e4f0318c64fdda3f
ostf_sha: 82465a94eed4eff1fc8d8e1f2fb7e9993c22f068
production: docker
release: '6.0'

Reproduced on latest 6.0 iso:

http://jenkins-product.srt.mirantis.net:8080/job/6.0.ubuntu.bvt_2/88/testReport/junit/(root)/deploy_ha_vlan/deploy_ha_vlan/

Astute reported that puppet on the 2nd controller hanged:

2014-11-14T13:20:01 warning: [399] Following nodes have puppet hung: '2'
...
2014-11-14T13:20:03 debug: [399] Nodes statuses: {"succeed"=>[], "error"=>["2"], "running"=>["5"]}
2014-11-14T13:20:03 debug: [399] Node "2" has failed to deploy. There is no more retries for puppet run.
...
2014-11-14T13:20:03 debug: [399] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}, {"uid"=>"5", "progress"=>0, "status"=>"deploying", "role"=>"controller"}]}
2014-11-14T13:20:03 debug: [399] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"controller"}]}

But there were no 'puppet' processes on than slave (node-2) and here is its puppet report:

http://paste.openstack.org/show/133185/

Puppet logs contained no errors and 'crm status' showed that node is online. Also it successfully reported about status via mcollective:

http://paste.openstack.org/show/133171/

Re-deployment of environment was successful. Probably this issue is related to https://bugs.launchpad.net/mos/+bug/1385079

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-11-14:

fuel-snapshot-2014-11-14_13-42-01.tgz Edit (4.8 MiB, application/x-tar)

Vladimir Kuklin (vkuklin) on 2014-11-17

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-11-19:

No more bug reproducing as i can see. Mark it as duplicate of https://bugs.launchpad.net/mos/+bug/1385079

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-11-27:

fail_error_deploy_ha_vlan-2014_11_27__01_49_56.tar Edit (45.1 MiB, application/x-tar)

It was reproduced again. Open bug for future investigation.

Changed in fuel:
assignee:	Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2014-11-27:

Release notes: Occasionally the problem described in the bug is really happened. In mostly case simple redeploy for problem cluster should solve the problem.

Reason to move from 6.0 to 6.1: to fix it we should change mechanism which used for detects state of puppet on nodes. This is big changes for release which go to hard freeze from day to day.

tags:

added: docs

Vladimir Sharshov (vsharshov) on 2014-12-01

Changed in fuel:
milestone:	6.0 → 6.1
no longer affects:	fuel/6.1.x

Vladimir Sharshov (vsharshov) on 2014-12-03

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-03: Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138790

Revision history for this message

Mykhaylo Slobodyan (mykhaylo-slobodyan) wrote on 2014-12-08:

tarball with logs Edit (5.4 MiB, application/x-tar)

Adding sighting on 5.1.1-48 RC3. Reproduced using deploy_ha_flat autotest. After deploy, one of two compute nodes was in error state. Pupped run didn't produced any errors and finished successfully, but in logs I see:

2014-12-08 16:39:48 WARNING [436] Following nodes have puppet hung: '4'

Simple redeploy solved the problem with that compute node. Logs attached.

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-12-31:

fuel-snapshot-astute-2014-12-30_13-19-08.tgz Edit (150.6 MiB, application/x-tar)

Download full text (3.5 KiB)

Reproduced on 6.0 exactly as described in https://bugs.launchpad.net/mos/+bug/1385079/comments/11

What happened:
# time frame in diagnostic snapshot:
# from 2014-12-30 12:16:00 to 2014-12-30 12:18:30
# nodes: 7, 12 and 13 were deploying, deploying of nodes 7 and 12 were failed.

1) from 'deploy_nodes', was prepared and called method 'PuppetTask.run()' for each of nodes 7, 12 and 13
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/deployment_engine/granular_deployment.rb#L88

2) First check of puppet status just to store the status (2014-12-30T12:16:01, method 'last_run_summary' in astute.log):
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L36

3) Second check of puppet status to make sure that puppet is stopped (2014-12-30T12:16:01, method 'last_run_summary' in astute.log):
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L121

4) Run the puppet agent on the nodes (2014-12-30T12:16:01, method 'runonce' in astute.log):
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L128

5) Next puppet status check should be done only after "sleep Astute.config.PUPPET_FADE_INTERVAL"
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L131

But puppet status is checked immediately after task run here (2014-12-30T12:16:02, method 'last_run_summary' in astute.log):
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/deployment_engine/granular_deployment.rb#L94

    This caused removing of lock file on node 7 and 12 before the puppet process was started:
    ... "Process not running but not empty lockfile is present. Trying to remove lockfile...ok."
    http://paste.openstack.org/show/155223/

As the result: puppet status returned 'idling' for nodes 7 and 12 despite the fact that puppet was run successfully. Two minutes later astute failed deploy:
2014-12-30T12:18:27 warning: [393] Following nodes have puppet hung: '7,12'
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L134

Suggestions:

- move 'sleep Astute.config.PUPPET_DEPLOY_INTERVAL' from L119 to L94 (put it before 'case status = check_status(node_id)') in the 'granular_deployment.rb'

- consider possibility not to delete '/tmp/puppetd.lock' each time when just status is checked:
https://github.com/stackforge/fuel-astute/blob/master/mcagents/puppetd.rb#L128
This lock file has no sense at the moment and can be removed in the "runonce" method before checking 'set_status'.

p.s.: please be noted that there were two simultaneously running deploys in the diagnostic snapshot.

Reproduced on 6.0 exactly as described in https://bugs.launchpad.net/mos/+bug/1385079/comments/11

What happened:
# time frame in diagnostic snapshot: 
# from 2014-12-30 12:16:00 to 2014-12-30 12:18:30
# nodes: 7, 12 and 13 were deploying, deploying of nodes 7 and 12 were failed.

1)  from 'deploy_nodes', was prepared and called method 'PuppetTask.run()' for each of nodes 7, 12 and 13
   https://github.com/stackforge/fuel-astute/blob/master/lib/astute/deployment_engine/granular_deployment.rb#L88

2) First check of puppet status just to store the status (2014-12-30T12:16:01, method 'last_run_summary' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L36

3) Second check of puppet status to make sure that puppet is stopped (2014-12-30T12:16:01, method 'last_run_summary' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L121

4) Run the puppet agent on the nodes (2014-12-30T12:16:01, method 'runonce' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L128

5) Next puppet status check should be done only after  "sleep Astute.config.PUPPET_FADE_INTERVAL"
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/puppet_task.rb#L131

But puppet status is checked immediately after task run here (2014-12-30T12:16:02, method 'last_run_summary' in astute.log):
    https://github.com/stackforge/fuel-astute/blob/master/lib/astute/deployment_engine/granular_deployment.rb#L94

Suggestions:

- move 'sleep Astute.config.PUPPET_DEPLOY_INTERVAL' from L119 to L94 (put it before 'case status = check_status(node_id)') in the 'granular_deployment.rb'

- consider possibility not to delete '/tmp/puppetd.lock' each time when just status is checked:
   https://github.com/stackforge/fuel-astute/blob/master/mcagents/puppetd.rb#L128
   This lock file has no sense at the moment and can be removed in the "runonce" method before checking 'set_status'.

p.s.: please be noted that there were two simultaneously running deploys in the diagnostic snapshot.

{"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "auth_required": true, "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}}}, "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-12-31:

2014-12-30-142437_1269x1010_scrot.png Edit (487.2 KiB, image/png)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-12: Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/146445

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-01-12:

#10

https://review.openstack.org/#/c/146445/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-14: Change abandoned on fuel-astute (master)

#11

Change abandoned by Vladimir Sharshov (<email address hidden>) on branch: master
Review: https://review.openstack.org/138790
Reason: Changes including in this changes:
- https://review.openstack.org/#/c/146464/
- https://review.openstack.org/#/c/141320/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-20: Related fix merged to fuel-astute (master)

#12

Reviewed: https://review.openstack.org/146445
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=072dc2b253b2930a4fe3612014c948d212c5fc08
Submitter: Jenkins
Branch: master

commit 072dc2b253b2930a4fe3612014c948d212c5fc08
Author: Dennis Dmitriev <email address hidden>
Date: Mon Jan 12 14:17:38 2015 +0200

Avoid of calling 'puppet_daemon_status' twice at the same moment

    If MCollective agent calls 'puppet_daemon_status' from puppetd.rb
    twice without a delay, then this method can remove 'puppetd.lock'
    before puppet process appeared. This cause a wrong puppet status
    detection - 'idle' , despite the fact that puppet has been started.

    Let's after run_task(), where 'puppet_daemon_status' is called
    inside the MCollective script, wait *before* we call check_status()
    from Astute.

Change-Id: I2a6e2cf86b58b4c758bfca77addfbb8e4943f3d0
Related-Bug: #1392779

Stanislaw Bogatkin (sbogatkin) on 2015-02-06

Changed in fuel:
status:	In Progress → Fix Committed

Michele Fagan (michelefagan) on 2016-05-13

Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.