Adding a multirole node "Controller+Cinder" instead single-roles nodes "Controller" and "Cinder" get error deployment

Bug #1425945 reported by okosse
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
In Progress
High
Kamil Sambor
6.0.x
Won't Fix
Medium
Fuel Python (Deprecated)
6.1.x
In Progress
High
Kamil Sambor

Bug Description

Step to reproduce:
1. Create HA cluster
2. Add 3 nodes with controller role
3. Deploy the cluster
4. Add a node with cinder role
5. Deploy the cluster
6. Remove a Cinder node
7. Remove a Controller node
8. Add a node with "Controller+Cinder" role
9. Deploy the cluster

Get Error "Deployment has failed. Timeout of deployment is exceeded."

I reproduced this bug with vCenter and QEMU.

--------iso version--------------
api: '1.0'
astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
auth_required: true
build_id: 2015-02-22_20-49-44
build_number: '102'
feature_groups:
- mirantis
fuellib_sha: 3a441bce1b525fe03ace45adfdda495cb64869d2
fuelmain_sha: a715aba8caf200390498a6bd29e6bffd9783f242
nailgun_sha: adbc77b3435d1484e39a46075bd5d558d7ec68b7
ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
production: docker
release: 6.0.1
release_versions:
  2014.2-6.0.1:
    VERSION:
      api: '1.0'
      astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
      build_id: 2015-02-22_20-49-44
      build_number: '102'
      feature_groups:
      - mirantis
      fuellib_sha: 3a441bce1b525fe03ace45adfdda495cb64869d2
      fuelmain_sha: a715aba8caf200390498a6bd29e6bffd9783f242
      nailgun_sha: adbc77b3435d1484e39a46075bd5d558d7ec68b7
      ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
      production: docker
      release: 6.0.1

Revision history for this message
okosse (okosse) wrote :
Changed in fuel:
milestone: none → 6.1
milestone: 6.1 → 6.0.1
okosse (okosse)
description: updated
Revision history for this message
Iryna Vovk (ivovk) wrote :

The same issue occurred on 6.1 release. I got this bug with vCenter.

--------iso version--------------
api: '1.0'
astute_sha: d81ff53c2f467151ecde120d3a4d284e3b5b3dfc
auth_required: true
build_id: 2015-02-22_22-54-44
build_number: '138'
feature_groups:
- mirantis
fuellib_sha: f5d713a3121fa971d63386f0d751a37dc58d061c
fuelmain_sha: b975019fabdb429c1869047df18dd792d2163ecc
nailgun_sha: 8a1e03b5863f4e91981278f154b088069415efae
ostf_sha: 1a0b2c6618fac098473c2ed5a9af11d3a886a3bb
production: docker
python-fuelclient_sha: 5657dbf06fddb74adb61e9668eb579a1c57d8af8
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: d81ff53c2f467151ecde120d3a4d284e3b5b3dfc
      build_id: 2015-02-22_22-54-44
      build_number: '138'
      feature_groups:
      - mirantis
      fuellib_sha: f5d713a3121fa971d63386f0d751a37dc58d061c
      fuelmain_sha: b975019fabdb429c1869047df18dd792d2163ecc
      nailgun_sha: 8a1e03b5863f4e91981278f154b088069415efae
      ostf_sha: 1a0b2c6618fac098473c2ed5a9af11d3a886a3bb
      production: docker
      python-fuelclient_sha: 5657dbf06fddb74adb61e9668eb579a1c57d8af8
      release: '6.1'

Iryna Vovk (ivovk)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Igor Zinovik (izinovik) wrote :

According to snapshot mcollective agent on node-4 does not respond.

After asute casts 'deployment' task we can see following messages in /var/log/docker-logs/astutu/astute.log on the master node:
info: [416] Processing RPC call 'deploy'
info: [416] 'deploy' method called with data: {"args"=>{"task_uuid"=>"8772292c-e33c-49a6-bcd9-f313a02636b7", "deployment_info"=>[{"management_interface"=
info: [416] Using Astute::DeploymentEngine::NailyFact for deployment.
info: [416] Deployment mode ha_compact
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"4", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"3", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"5", :statuscode=>0, :statusmsg
debug: [416] Retry #1 to run mcollective agent on nodes: '1'
err: [416] MCollective agents '1' didn't respond within the allotted time.

err: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: cmd: ntpdate -u 0.ubuntu.pool.ntp.org
1.ubuntu.pool.ntp.org
2.ubuntu.pool.ntp.org
3.ubuntu.pool.ntp.org
ntp.ubuntu.com
                                               mcollective error: 8772292c-e33c-49a6-bcd9-f313a02636b7: MCollective agents '1' didn't respond within the allotted time.

debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"4", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"3", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"5", :statuscode=>0, :statusmsg
debug: [416] Retry #1 to run mcollective agent on nodes: '1'
err: [416] MCollective agents '1' didn't respond within the allotted time.

anaconda's log on node-4 contains errors:
install/anaconda.log:2015-02-26T11:52:08.861766+00:00 warning: Error downloading http://10.108.0.2:8080/centos/x86_64//images/updates.img: HTTP response code said error
install/anaconda.log:2015-02-26T11:52:08.862137+00:00 warning: Error downloading http://10.108.0.2:8080/centos/x86_64//images/product.img: HTTP response code said error

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

QA team, please verify if this issue affects 6.1 release - it might be already resolved with granular deployment tasks

Changed in fuel:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/160415

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced with the same scenario.

I've added the system test to reproduce this bug: --group=ha_flat_addremove
Test can be fetched from the review: https://review.openstack.org/160415

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/160415
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=ad95f2caabacfb95c11c47899c8eb18e41da2c7c
Submitter: Jenkins
Branch: master

commit ad95f2caabacfb95c11c47899c8eb18e41da2c7c
Author: Dennis Dmitriev <email address hidden>
Date: Tue Mar 10 14:10:07 2015 +0200

    Add ha_flat_addremove test group

    This test based on testcase from https://review.openstack.org/#/c/158677
    After adding cinder, then removing cinder + remove one controller +
    adding a new node with controller+cinder, this new node is just provisioned
    but doesn't deployed.

    Change-Id: I8ddaaa53d4acf8a7fe09b7722172f1922ea87873
    Related-Bug:#1425945

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Dmitry Pyzhov (dpyzhov)
tags: added: module-tasks
Dima Shulyak (dshulyak)
tags: added: module-nailgun
removed: module-tasks
Revision history for this message
Dima Shulyak (dshulyak) wrote :

Remove nodes message was generated

2015-04-02 09:14:05.090 DEBUG [7feda6b54740] (__init__) RPC cast to orchestrator:
{
    "args": {
        "engine": {
            "url": "http://10.109.15.2:80/cobbler_api",
            "username": "cobbler",
            "password": "bCfNrsqz",
            "master_ip": "10.109.15.2"
        },
        "nodes": [
            {
                "mclient_remove": true,
                "slave_name": "node-5",
                "id": 5,
                "roles": [
                    "controller"
                ],
                "uid": 5
            },
            {
                "mclient_remove": true,
                "slave_name": "node-6",
                "id": 6,
                "roles": [
                    "cinder"
                ],
                "uid": 6
            }
        ],
        "task_uuid": "e8d7cd7a-34f3-4075-a5aa-50da796949e0"
    },
    "respond_to": "remove_nodes_resp",
    "method": "remove_nodes",
    "api_version": "1.0"
}

But new nodes deployment failed with deadlock error

2015-04-02 09:14:09.436 ERROR [7feda6b54740] (manager) Traceback (most recent call last):
-----------------------------
  DBAPIError: (TransactionRollbackError) deadlock detected
DETAIL: Process 1318 waits for ShareLock on transaction 2867; blocked by process 1382.
Process 1382 waits for ShareLock on transaction 2865; blocked by process 1318.
HINT: See server log for query details.
 'UPDATE tasks SET cache=%(cache)s WHERE tasks.id = %(tasks_id)s' {'cache': '{"args": {"task_uuid": "949f9086-fc22-400f-88cf-acd036a94458", "deployment_info":

I see several problems:
1. deadlock itself
2. removal should be synchronous, and should not be executed before deployment, there is a chance that we will catch race
condition when old controller will be removed and new added
3. after we introduced async generation - there is a possibility that error message will be swallowed

tags: added: tricky
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.