Fuel for OpenStack

Deployment failed with Lock file and PID file exist; puppet is running if start new deployment right after offline node removal

Bug #1496411 reported by Alexander Kurenyshev on 2015-09-16

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Ivan Ponomarev	Fuel for OpenStack 9.0
7.0.x	Invalid	High	Ivan Ponomarev	Fuel for OpenStack 7.0-updates
8.0.x	Won't Fix	Medium	Ivan Ponomarev	Fuel for OpenStack 8.0
Mitaka	Invalid	High	Ivan Ponomarev	Fuel for OpenStack 9.0

Bug Description

Steps to reproduce:

1) Deploy 3 controller, 1 compute
2) Shut down the primary controller node
3) Wait until Nailgun mark the primary controller as offline. Remove it from the cluster - click on 'Remove' button near offline node
4) Wait until removing process finished (node is removed from UI, new deployment wasn't started)
5) Add another node to the cluster as the third controller
6) Deploy changes

Expected result:
Deploy is passed. There are no errors at the logs.

Actual result:
Deploy are failed with timeout error. But astute is still working, actually deploy is in progress.

Fuel RC2 used.

Find logs by the link https://drive.google.com/file/d/0BzdDsIW-ymG2OUFXc3pzOG9ValE/view?usp=sharing

See original description

Tags:

Revision history for this message

Andrey Maximov (maximov) wrote on 2015-09-16:

What about workaround ? if you repeat deployment again, will it work?

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2015-09-16:

Actually I don't know, environment doesn't exist yet. Let me try to reproduce

Revision history for this message

Andrey Maximov (maximov) wrote on 2015-09-16:

setting incomplete as we need to know reproduction ratio to set priority correctly.
if it is reproduced often - this is critical

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-09-16:

Mark it as high, because it was reproduced only once as i known.

We investigate this env with Vitaly Kramskikh: looks like env contain several tasks about deployment. UI used fist of them which failed. Astute and Nailgun report ready status, but it does not affect UI part or cluster status.

We need try to reproduce it.

Changed in fuel:
importance:	Critical → High

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2015-09-17:

When I was trying to reproduce this bug I got an error again but not the same.
For now error is:
Deployment has failed. Method granular_deploy. 43fe926e-e21d-4646-837e-02be274be575: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
ID: 2 - Reason: Lock file and PID file exist; puppet is running.

We decided with V.Sharshov to attach this error to this bug.
Logs are here https://drive.google.com/file/d/0BzdDsIW-ymG2R2xNRW5pNERPRUk/view?usp=sharing

Vladimir Sharshov (vsharshov) on 2015-09-18

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
tags:	added: module-astute

Vladimir Sharshov (vsharshov) on 2015-09-18

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-09-20:

2015-09-20-205505_964x828_scrot.png Edit (88.8 KiB, image/png)

Reproduced on the following configuration:

Network: Neutron/Tun, tagget networks;
Volumes: Ceph RBD for ephemeral volumes and disabled Cinder LVM
Images: Swift
Nodes:
3 Controllers,
1 Compute,
3 Ceph

Scenario for deployed cluster:
1. Shut down the Nailgun primary controller node (find 'primary-controller' in `$ hiera roles` output on one of controllers)
2. Wait until Nailgun mark the primary controller as offline and remove it from the cluster
3. Add another node to the cluster as the third controller
4. Deploy changes

Result:
-------------
Error
Deployment has failed. Method granular_deploy. 816deea0-4a4f-4f7f-a2be-50a18016fd1c: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
ID: 20 - Reason: Lock file and PID file exist; puppet is running.
.

This caused because when a disabled node is removed from cluster, there is started some 'hidden' deploy that do some actions on the remaining controllers.

If the step #3 will be performed without waiting for finish of this 'hidden' action, then error "Lock file and PID file exist; puppet is running" is appeared on the dashboard, but the 'hidden' action continues, see the screenshot attached.

If wait for finish of 'hidden' action after step #2 , then deploy for add third controller is failed by timeout, the same as in the bug description.

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2015-09-21:

Moving to 8.0 because of HCF in 7.0 release

Changed in fuel:
milestone:	7.0 → 8.0

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2015-10-08:

Root cause is still unclear. We will not deliver it in 7.0-mu1

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-python

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-11-05:

QA, please verify, is this bug still reproducing?

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2015-11-11:

#10

Vova, yes it's still reproducing. With the same steps as at bug's description.

Fuel used:
  2015.1.0-8.0:
    VERSION:
      api: '1.0'
      astute_sha: 959b06c5ef8143125efd1727d350c050a922eb12
      build_id: '116'
      build_number: '116'
      feature_groups:
      - mirantis
      fuel-agent_sha: 9da73b497be5f91cb79f91e74d73eb0525be1c71
      fuel-createmirror_sha: 84d5e9721848e84d65001718037370e52d2a0987
      fuel-library_sha: bc18428f04dd64dd81a3070b3733111e5c278e04
      fuel-nailgun-agent_sha: 00b4b11553c250f22c0079fb74c8b782dcb7b740
      fuel-nailgun_sha: 4ab6b1f994846fae5b14bcd7f892a621d21132bb
      fuel-ostf_sha: 2ddb42865ca466a58d23e04713e2d79cc54070c6
      fuel-upgrade_sha: 1e894e26d4e1423a9b0d66abd6a79505f4175ff6
      fuelmain_sha: cfed10fd84dc95a645e8760a49646e2303ab5d16
      fuelmenu_sha: ed146b5c5974eb19d723b3a2784abdd574df5a5e
      network-checker_sha: 722a2a46503cffa82d78243d516bf762cd9715fc
      openstack_version: 2015.1.0-8.0
      production: docker
      python-fuelclient_sha: a3b4d6b395c8d23c04a94925006e742caf9ff7cd
      release: '8.0'
      shotgun_sha: 25dd78a3118267e3616df0727ce746e7dead2d67
shotgun_sha: 25dd78a3118267e3616df0727ce746e7dead2d67

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2015-11-11:

#11

Here is a snapshot for comment above: https://drive.google.com/file/d/0BzdDsIW-ymG2NEpSbndmN09WQkE/view?usp=sharing

Alexander Kurenyshev (akurenyshev) on 2015-11-11

Changed in fuel:
status:	Incomplete → Confirmed

Dmitry Pyzhov (dpyzhov) on 2015-11-20

tags:

added: team-enhancements

Dmitry Pyzhov (dpyzhov) on 2015-12-14

tags:

added: team-bugfix
removed: team-enhancements

Fuel Devops McRobotson (fuel-devops-robot) on 2015-12-30

Changed in fuel:
milestone:	8.0 → 9.0
status:	Confirmed → New

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2015-12-31:

#12

Looks like status was changes mistakenly. Revert back.

Changed in fuel:
status:	New → Incomplete
status:	Incomplete → Confirmed
status:	Confirmed → Incomplete
no longer affects:	fuel/future

Dmitry Pyzhov (dpyzhov) on 2016-01-13

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Ivan Ponomarev (ivanzipfer) wrote on 2016-01-25:

#13

Can't reproduce on 8.0.

Ivan Ponomarev (ivanzipfer) on 2016-01-27

Changed in fuel:
assignee:	Vladimir Sharshov (vsharshov) → Ivan Ponomarev (ivanzipfer)

Ivan Ponomarev (ivanzipfer) on 2016-01-27

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-01-28:

#14

Issue reproduced on 8.0-478 iso after steps pointed by Denis Dmitriev:
Scenario for deployed cluster:
1. Shut down the Nailgun primary controller node (find 'primary-controller' in `$ hiera roles` output on one of controllers)
2. Wait until Nailgun mark the primary controller as offline and remove it from the cluster
3. Add another node to the cluster as the third controller
4. Deploy changes

Please, note that we wait in tests after second step that task on node removal is successful

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-01-28:

#15

fail_error_replace_primary_controller-fuel-snapshot-2016-01-27_19-07-16.tar.xz Edit (52.5 MiB, application/octet-stream)

Andrey Sledzinskiy (asledzinskiy) on 2016-01-29

summary:	- Waiting timeout was reached but it value was from previous deploy + Deployment failed with Lock file and PID file exist; puppet is running + if start new deployment right after offline node removal
description:	updated

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-01-29:

#16

In tests we only wait for 'node_deletion' task to be finished. But re-deployment is also starting that we don't wait.
When I tested it through UI sometimes after offline controller removal new deployment started with a big delay so it's easy to make mistake by starting new deployment.

I think it should be absolutely clear looking at UI that there are still running tasks.
So lowering priority to medium because it's bad UX bug that occurs only after tricky actions with add/delete node immediately.

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-01-29:

#17

As it's medium bug it might be moved to 9.0

Andrey Sledzinskiy (asledzinskiy) on 2016-01-29

description:

updated

Revision history for this message

Ihor Kalnytskyi (ikalnytskyi) wrote on 2016-01-29:

#18

Medium bugs are not longer fixed in 8.0.

Revision history for this message

Ivan Ponomarev (ivanzipfer) wrote on 2016-01-29:

#19

This steps in description was reproduced by tests.
Tests is not waiting end of deployment after node deleting.
I tryed do the same steps via API and here you can se that nailgun runs deploy tasks in paralel:

[root@nailgun ~]# fuel task
id | status | name | cluster | progress | uuid
---|---------|-------------------|---------|----------|-------------------------------------
11 | ready | create_stats_user | 1 | 100 | 88e8dac9-c6b7-4656-a429-0174233ca761
15 | ready | check_networks | 1 | 100 | d7f0e4ee-d288-4e22-aa10-bd50211d107b
13 | running | deployment | 1 | 51 | 8c9b2727-f106-4dd1-a2ff-0a35748d0582
12 | ready | node_deletion | 1 | 51 | 5713df93-c2ee-4f2e-842e-ddadea25ea44
19 | running | deployment | 1 | 0 | c1288d6d-c2a1-4e3f-be55-12b908d185c8
18 | running | provision | 1 | 40 | b5554c5f-bc72-49ee-9133-e70e4ef24772
14 | running | deploy | 1 | 48 | ff2e08af-0ebb-42b0-86f6-e5f34ef98b55
[root@nailgun ~]# fuel task
id | status | name | cluster | progress | uuid
---|---------|-------------------|---------|----------|-------------------------------------
11 | ready | create_stats_user | 1 | 100 | 88e8dac9-c6b7-4656-a429-0174233ca761
15 | ready | check_networks | 1 | 100 | d7f0e4ee-d288-4e22-aa10-bd50211d107b
18 | ready | provision | 1 | 100 | b5554c5f-bc72-49ee-9133-e70e4ef24772
13 | running | deployment | 1 | 11 | 8c9b2727-f106-4dd1-a2ff-0a35748d0582
12 | ready | node_deletion | 1 | 11 | 5713df93-c2ee-4f2e-842e-ddadea25ea44
19 | running | deployment | 1 | 11 | c1288d6d-c2a1-4e3f-be55-12b908d185c8
14 | running | deploy | 1 | 63 | ff2e08af-0ebb-42b0-86f6-e5f34ef98b55

I think second deployment task start counting of timeout and does not wait when previous task ended

Revision history for this message

Ivan Ponomarev (ivanzipfer) wrote on 2016-01-29:

#21

I created new bug here:
https://bugs.launchpad.net/fuel/+bug/1539693

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-01: Related fix proposed to fuel-qa (master)

#22

Related fix proposed to branch: master
Review: https://review.openstack.org/274568

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-01: Related fix proposed to fuel-qa (stable/8.0)

#23

Related fix proposed to branch: stable/8.0
Review: https://review.openstack.org/274635

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-01: Related fix merged to fuel-qa (master)

#24

Reviewed: https://review.openstack.org/274568
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=ca98341935b45bdd9344e05011674180499a3919
Submitter: Jenkins
Branch: master

commit ca98341935b45bdd9344e05011674180499a3919
Author: asledzinskiy <email address hidden>
Date: Mon Feb 1 11:35:54 2016 +0200

Add wait for deployment task after delete node

- Add wait for deployment task to be finished
after deleting contoller node from cluster

Change-Id: Icba1241023c600a61e1b6688c2ad1c9c00e9d21b
Related-Bug: #1496411

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-01: Related fix merged to fuel-qa (stable/8.0)

#25

Reviewed: https://review.openstack.org/274635
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=b7000824e66e828a5511018df890364ec11f6c8d
Submitter: Jenkins
Branch: stable/8.0

commit b7000824e66e828a5511018df890364ec11f6c8d
Author: asledzinskiy <email address hidden>
Date: Mon Feb 1 11:35:54 2016 +0200

Add wait for deployment task after delete node

- Add wait for deployment task to be finished
after deleting contoller node from cluster

Change-Id: Icba1241023c600a61e1b6688c2ad1c9c00e9d21b
Related-Bug: #1496411

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.