Deployment failed with Lock file and PID file exist; puppet is running if start new deployment right after offline node removal

Bug #1496411 reported by Alexander Kurenyshev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Ivan Ponomarev
7.0.x
Invalid
High
Ivan Ponomarev
8.0.x
Won't Fix
Medium
Ivan Ponomarev
Mitaka
Invalid
High
Ivan Ponomarev

Bug Description

Steps to reproduce:

1) Deploy 3 controller, 1 compute
2) Shut down the primary controller node
3) Wait until Nailgun mark the primary controller as offline. Remove it from the cluster - click on 'Remove' button near offline node
4) Wait until removing process finished (node is removed from UI, new deployment wasn't started)
5) Add another node to the cluster as the third controller
6) Deploy changes

Expected result:
Deploy is passed. There are no errors at the logs.

Actual result:
Deploy are failed with timeout error. But astute is still working, actually deploy is in progress.

Fuel RC2 used.

Find logs by the link https://drive.google.com/file/d/0BzdDsIW-ymG2OUFXc3pzOG9ValE/view?usp=sharing

Revision history for this message
Andrey Maximov (maximov) wrote :

What about workaround ? if you repeat deployment again, will it work?

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

Actually I don't know, environment doesn't exist yet. Let me try to reproduce

Revision history for this message
Andrey Maximov (maximov) wrote :

setting incomplete as we need to know reproduction ratio to set priority correctly.
if it is reproduced often - this is critical

Changed in fuel:
status: New → Incomplete
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Mark it as high, because it was reproduced only once as i known.

We investigate this env with Vitaly Kramskikh: looks like env contain several tasks about deployment. UI used fist of them which failed. Astute and Nailgun report ready status, but it does not affect UI part or cluster status.

We need try to reproduce it.

Changed in fuel:
importance: Critical → High
Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

When I was trying to reproduce this bug I got an error again but not the same.
For now error is:
Deployment has failed. Method granular_deploy. 43fe926e-e21d-4646-837e-02be274be575: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
ID: 2 - Reason: Lock file and PID file exist; puppet is running.

We decided with V.Sharshov to attach this error to this bug.
Logs are here https://drive.google.com/file/d/0BzdDsIW-ymG2R2xNRW5pNERPRUk/view?usp=sharing

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
tags: added: module-astute
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on the following configuration:

Network: Neutron/Tun, tagget networks;
Volumes: Ceph RBD for ephemeral volumes and disabled Cinder LVM
Images: Swift
Nodes:
3 Controllers,
1 Compute,
3 Ceph

Scenario for deployed cluster:
1. Shut down the Nailgun primary controller node (find 'primary-controller' in `$ hiera roles` output on one of controllers)
2. Wait until Nailgun mark the primary controller as offline and remove it from the cluster
3. Add another node to the cluster as the third controller
4. Deploy changes

Result:
-------------
Error
Deployment has failed. Method granular_deploy. 816deea0-4a4f-4f7f-a2be-50a18016fd1c: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
ID: 20 - Reason: Lock file and PID file exist; puppet is running.
.

This caused because when a disabled node is removed from cluster, there is started some 'hidden' deploy that do some actions on the remaining controllers.

If the step #3 will be performed without waiting for finish of this 'hidden' action, then error "Lock file and PID file exist; puppet is running" is appeared on the dashboard, but the 'hidden' action continues, see the screenshot attached.

If wait for finish of 'hidden' action after step #2 , then deploy for add third controller is failed by timeout, the same as in the bug description.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Moving to 8.0 because of HCF in 7.0 release

Changed in fuel:
milestone: 7.0 → 8.0
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Root cause is still unclear. We will not deliver it in 7.0-mu1

Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

QA, please verify, is this bug still reproducing?

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

Vova, yes it's still reproducing. With the same steps as at bug's description.

Fuel used:
  2015.1.0-8.0:
    VERSION:
      api: '1.0'
      astute_sha: 959b06c5ef8143125efd1727d350c050a922eb12
      build_id: '116'
      build_number: '116'
      feature_groups:
      - mirantis
      fuel-agent_sha: 9da73b497be5f91cb79f91e74d73eb0525be1c71
      fuel-createmirror_sha: 84d5e9721848e84d65001718037370e52d2a0987
      fuel-library_sha: bc18428f04dd64dd81a3070b3733111e5c278e04
      fuel-nailgun-agent_sha: 00b4b11553c250f22c0079fb74c8b782dcb7b740
      fuel-nailgun_sha: 4ab6b1f994846fae5b14bcd7f892a621d21132bb
      fuel-ostf_sha: 2ddb42865ca466a58d23e04713e2d79cc54070c6
      fuel-upgrade_sha: 1e894e26d4e1423a9b0d66abd6a79505f4175ff6
      fuelmain_sha: cfed10fd84dc95a645e8760a49646e2303ab5d16
      fuelmenu_sha: ed146b5c5974eb19d723b3a2784abdd574df5a5e
      network-checker_sha: 722a2a46503cffa82d78243d516bf762cd9715fc
      openstack_version: 2015.1.0-8.0
      production: docker
      python-fuelclient_sha: a3b4d6b395c8d23c04a94925006e742caf9ff7cd
      release: '8.0'
      shotgun_sha: 25dd78a3118267e3616df0727ce746e7dead2d67
shotgun_sha: 25dd78a3118267e3616df0727ce746e7dead2d67

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: added: team-enhancements
Dmitry Pyzhov (dpyzhov)
tags: added: team-bugfix
removed: team-enhancements
Changed in fuel:
milestone: 8.0 → 9.0
status: Confirmed → New
Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

Looks like status was changes mistakenly. Revert back.

Changed in fuel:
status: New → Incomplete
status: Incomplete → Confirmed
status: Confirmed → Incomplete
no longer affects: fuel/future
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Ivan Ponomarev (ivanzipfer) wrote :

Can't reproduce on 8.0.

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Ivan Ponomarev (ivanzipfer)
Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue reproduced on 8.0-478 iso after steps pointed by Denis Dmitriev:
Scenario for deployed cluster:
1. Shut down the Nailgun primary controller node (find 'primary-controller' in `$ hiera roles` output on one of controllers)
2. Wait until Nailgun mark the primary controller as offline and remove it from the cluster
3. Add another node to the cluster as the third controller
4. Deploy changes

Result:
-------------
Error
Deployment has failed. Method granular_deploy. 816deea0-4a4f-4f7f-a2be-50a18016fd1c: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
ID: 20 - Reason: Lock file and PID file exist; puppet is running.

Please, note that we wait in tests after second step that task on node removal is successful

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
summary: - Waiting timeout was reached but it value was from previous deploy
+ Deployment failed with Lock file and PID file exist; puppet is running
+ if start new deployment right after offline node removal
description: updated
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

In tests we only wait for 'node_deletion' task to be finished. But re-deployment is also starting that we don't wait.
When I tested it through UI sometimes after offline controller removal new deployment started with a big delay so it's easy to make mistake by starting new deployment.

I think it should be absolutely clear looking at UI that there are still running tasks.
So lowering priority to medium because it's bad UX bug that occurs only after tricky actions with add/delete node immediately.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

As it's medium bug it might be moved to 9.0

description: updated
Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Medium bugs are not longer fixed in 8.0.

Revision history for this message
Ivan Ponomarev (ivanzipfer) wrote :

This steps in description was reproduced by tests.
Tests is not waiting end of deployment after node deleting.
I tryed do the same steps via API and here you can se that nailgun runs deploy tasks in paralel:

[root@nailgun ~]# fuel task
id | status | name | cluster | progress | uuid
---|---------|-------------------|---------|----------|-------------------------------------
11 | ready | create_stats_user | 1 | 100 | 88e8dac9-c6b7-4656-a429-0174233ca761
15 | ready | check_networks | 1 | 100 | d7f0e4ee-d288-4e22-aa10-bd50211d107b
13 | running | deployment | 1 | 51 | 8c9b2727-f106-4dd1-a2ff-0a35748d0582
12 | ready | node_deletion | 1 | 51 | 5713df93-c2ee-4f2e-842e-ddadea25ea44
19 | running | deployment | 1 | 0 | c1288d6d-c2a1-4e3f-be55-12b908d185c8
18 | running | provision | 1 | 40 | b5554c5f-bc72-49ee-9133-e70e4ef24772
14 | running | deploy | 1 | 48 | ff2e08af-0ebb-42b0-86f6-e5f34ef98b55
[root@nailgun ~]# fuel task
id | status | name | cluster | progress | uuid
---|---------|-------------------|---------|----------|-------------------------------------
11 | ready | create_stats_user | 1 | 100 | 88e8dac9-c6b7-4656-a429-0174233ca761
15 | ready | check_networks | 1 | 100 | d7f0e4ee-d288-4e22-aa10-bd50211d107b
18 | ready | provision | 1 | 100 | b5554c5f-bc72-49ee-9133-e70e4ef24772
13 | running | deployment | 1 | 11 | 8c9b2727-f106-4dd1-a2ff-0a35748d0582
12 | ready | node_deletion | 1 | 11 | 5713df93-c2ee-4f2e-842e-ddadea25ea44
19 | running | deployment | 1 | 11 | c1288d6d-c2a1-4e3f-be55-12b908d185c8
14 | running | deploy | 1 | 63 | ff2e08af-0ebb-42b0-86f6-e5f34ef98b55

I think second deployment task start counting of timeout and does not wait when previous task ended

Revision history for this message
Ivan Ponomarev (ivanzipfer) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/274568

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (stable/8.0)

Related fix proposed to branch: stable/8.0
Review: https://review.openstack.org/274635

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/274568
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=ca98341935b45bdd9344e05011674180499a3919
Submitter: Jenkins
Branch: master

commit ca98341935b45bdd9344e05011674180499a3919
Author: asledzinskiy <email address hidden>
Date: Mon Feb 1 11:35:54 2016 +0200

    Add wait for deployment task after delete node

    - Add wait for deployment task to be finished
    after deleting contoller node from cluster

    Change-Id: Icba1241023c600a61e1b6688c2ad1c9c00e9d21b
    Related-Bug: #1496411

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (stable/8.0)

Reviewed: https://review.openstack.org/274635
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=b7000824e66e828a5511018df890364ec11f6c8d
Submitter: Jenkins
Branch: stable/8.0

commit b7000824e66e828a5511018df890364ec11f6c8d
Author: asledzinskiy <email address hidden>
Date: Mon Feb 1 11:35:54 2016 +0200

    Add wait for deployment task after delete node

    - Add wait for deployment task to be finished
    after deleting contoller node from cluster

    Change-Id: Icba1241023c600a61e1b6688c2ad1c9c00e9d21b
    Related-Bug: #1496411

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.