Bug #1425945 “Adding a multirole node “Controller+Cinder” instea...” : Bugs : Fuel for OpenStack

Revision history for this message

okosse (okosse) wrote on 2015-02-26:

#1

fuel-snapshot-2015-02-26_12-43-18.tgz Edit (3.9 MiB, application/x-tar)

Dennis Dmitriev (ddmitriev) on 2015-02-26

Changed in fuel:
milestone:	none → 6.1
milestone:	6.1 → 6.0.1

okosse (okosse) on 2015-02-26

description:

updated

Revision history for this message

Iryna Vovk (ivovk) wrote on 2015-02-26:

#2

Logs from tpi89 lab Edit (71.4 MiB, application/x-tar)

The same issue occurred on 6.1 release. I got this bug with vCenter.

--------iso version--------------
api: '1.0'
astute_sha: d81ff53c2f467151ecde120d3a4d284e3b5b3dfc
auth_required: true
build_id: 2015-02-22_22-54-44
build_number: '138'
feature_groups:
- mirantis
fuellib_sha: f5d713a3121fa971d63386f0d751a37dc58d061c
fuelmain_sha: b975019fabdb429c1869047df18dd792d2163ecc
nailgun_sha: 8a1e03b5863f4e91981278f154b088069415efae
ostf_sha: 1a0b2c6618fac098473c2ed5a9af11d3a886a3bb
production: docker
python-fuelclient_sha: 5657dbf06fddb74adb61e9668eb579a1c57d8af8
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: d81ff53c2f467151ecde120d3a4d284e3b5b3dfc
      build_id: 2015-02-22_22-54-44
      build_number: '138'
      feature_groups:
      - mirantis
      fuellib_sha: f5d713a3121fa971d63386f0d751a37dc58d061c
      fuelmain_sha: b975019fabdb429c1869047df18dd792d2163ecc
      nailgun_sha: 8a1e03b5863f4e91981278f154b088069415efae
      ostf_sha: 1a0b2c6618fac098473c2ed5a9af11d3a886a3bb
      production: docker
      python-fuelclient_sha: 5657dbf06fddb74adb61e9668eb579a1c57d8af8
      release: '6.1'

Iryna Vovk (ivovk) on 2015-02-26

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Igor Zinovik (izinovik) wrote on 2015-02-26:

#3

According to snapshot mcollective agent on node-4 does not respond.

After asute casts 'deployment' task we can see following messages in /var/log/docker-logs/astutu/astute.log on the master node:
info: [416] Processing RPC call 'deploy'
info: [416] 'deploy' method called with data: {"args"=>{"task_uuid"=>"8772292c-e33c-49a6-bcd9-f313a02636b7", "deployment_info"=>[{"management_interface"=
info: [416] Using Astute::DeploymentEngine::NailyFact for deployment.
info: [416] Deployment mode ha_compact
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"4", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"3", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"5", :statuscode=>0, :statusmsg
debug: [416] Retry #1 to run mcollective agent on nodes: '1'
err: [416] MCollective agents '1' didn't respond within the allotted time.

err: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: cmd: ntpdate -u 0.ubuntu.pool.ntp.org
1.ubuntu.pool.ntp.org
2.ubuntu.pool.ntp.org
3.ubuntu.pool.ntp.org
ntp.ubuntu.com
mcollective error: 8772292c-e33c-49a6-bcd9-f313a02636b7: MCollective agents '1' didn't respond within the allotted time.

debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"4", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"3", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"5", :statuscode=>0, :statusmsg
debug: [416] Retry #1 to run mcollective agent on nodes: '1'
err: [416] MCollective agents '1' didn't respond within the allotted time.

anaconda's log on node-4 contains errors:
install/anaconda.log:2015-02-26T11:52:08.861766+00:00 warning: Error downloading http://10.108.0.2:8080/centos/x86_64//images/updates.img: HTTP response code said error
install/anaconda.log:2015-02-26T11:52:08.862137+00:00 warning: Error downloading http://10.108.0.2:8080/centos/x86_64//images/product.img: HTTP response code said error

According to snapshot mcollective agent on node-4 does not respond.

After asute casts 'deployment' task we can see following messages in /var/log/docker-logs/astutu/astute.log on the master node:
info: [416] Processing RPC call 'deploy'
info: [416] 'deploy' method called with data: {"args"=>{"task_uuid"=>"8772292c-e33c-49a6-bcd9-f313a02636b7", "deployment_info"=>[{"management_interface"=
info: [416] Using Astute::DeploymentEngine::NailyFact for deployment.
info: [416] Deployment mode ha_compact
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"4", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"3", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"5", :statuscode=>0, :statusmsg
debug: [416] Retry #1 to run mcollective agent on nodes: '1'
err: [416] MCollective agents '1' didn't respond within the allotted time.

err: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: cmd: ntpdate -u 0.ubuntu.pool.ntp.org
1.ubuntu.pool.ntp.org
2.ubuntu.pool.ntp.org
3.ubuntu.pool.ntp.org
ntp.ubuntu.com
                                               mcollective error: 8772292c-e33c-49a6-bcd9-f313a02636b7: MCollective agents '1' didn't respond within the allotted time.

debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"4", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"3", :statuscode=>0, :statusmsg
debug: [416] 8772292c-e33c-49a6-bcd9-f313a02636b7: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"5", :statuscode=>0, :statusmsg
debug: [416] Retry #1 to run mcollective agent on nodes: '1'
err: [416] MCollective agents '1' didn't respond within the allotted time.

anaconda's log on node-4 contains errors:
install/anaconda.log:2015-02-26T11:52:08.861766+00:00 warning: Error downloading http://10.108.0.2:8080/centos/x86_64//images/updates.img: HTTP response code said error
install/anaconda.log:2015-02-26T11:52:08.862137+00:00 warning: Error downloading http://10.108.0.2:8080/centos/x86_64//images/product.img: HTTP response code said error

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-26:

#4

QA team, please verify if this issue affects 6.1 release - it might be already resolved with granular deployment tasks

Changed in fuel:
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-02: Related fix proposed to fuel-qa (master)

#5

Related fix proposed to branch: master
Review: https://review.openstack.org/160415

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-03-02:

#6

Reproduced with the same scenario.

I've added the system test to reproduce this bug: --group=ha_flat_addremove
Test can be fetched from the review: https://review.openstack.org/160415

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-01: Related fix merged to fuel-qa (master)

#7

Reviewed: https://review.openstack.org/160415
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=ad95f2caabacfb95c11c47899c8eb18e41da2c7c
Submitter: Jenkins
Branch: master

commit ad95f2caabacfb95c11c47899c8eb18e41da2c7c
Author: Dennis Dmitriev <email address hidden>
Date: Tue Mar 10 14:10:07 2015 +0200

Add ha_flat_addremove test group

    This test based on testcase from https://review.openstack.org/#/c/158677
    After adding cinder, then removing cinder + remove one controller +
    adding a new node with controller+cinder, this new node is just provisioned
    but doesn't deployed.

Change-Id: I8ddaaa53d4acf8a7fe09b7722172f1922ea87873
Related-Bug:#1425945

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-04-07:

#8

Reproduced on CI: http://jenkins-product.srt.mirantis.net:8080/view/6.1_swarm/job/6.1.system_test.ubuntu.thread_4/81/testReport/%28root%29/ha_flat_addremove/ha_flat_addremove/

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-04-07:

#9

fail_error_ha_flat_addremove-2015_04_02__11_26_16.tar.xz Edit (103.4 MiB, application/octet-stream)

Dmitry Pyzhov (dpyzhov) on 2015-04-09

tags:

added: module-tasks

Dima Shulyak (dshulyak) on 2015-04-12

tags:

added: module-nailgun
removed: module-tasks

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2015-04-14:

#10

Remove nodes message was generated

2015-04-02 09:14:05.090 DEBUG [7feda6b54740] (__init__) RPC cast to orchestrator:
{
    "args": {
        "engine": {
            "url": "http://10.109.15.2:80/cobbler_api",
            "username": "cobbler",
            "password": "bCfNrsqz",
            "master_ip": "10.109.15.2"
        },
        "nodes": [
            {
                "mclient_remove": true,
                "slave_name": "node-5",
                "id": 5,
                "roles": [
                    "controller"
                ],
                "uid": 5
            },
            {
                "mclient_remove": true,
                "slave_name": "node-6",
                "id": 6,
                "roles": [
                    "cinder"
                ],
                "uid": 6
            }
        ],
        "task_uuid": "e8d7cd7a-34f3-4075-a5aa-50da796949e0"
    },
    "respond_to": "remove_nodes_resp",
    "method": "remove_nodes",
    "api_version": "1.0"
}

But new nodes deployment failed with deadlock error

2015-04-02 09:14:09.436 ERROR [7feda6b54740] (manager) Traceback (most recent call last):
-----------------------------
DBAPIError: (TransactionRollbackError) deadlock detected
DETAIL: Process 1318 waits for ShareLock on transaction 2867; blocked by process 1382.
Process 1382 waits for ShareLock on transaction 2865; blocked by process 1318.
HINT: See server log for query details.
'UPDATE tasks SET cache=%(cache)s WHERE tasks.id = %(tasks_id)s' {'cache': '{"args": {"task_uuid": "949f9086-fc22-400f-88cf-acd036a94458", "deployment_info":

I see several problems:
1. deadlock itself
2. removal should be synchronous, and should not be executed before deployment, there is a chance that we will catch race
condition when old controller will be removed and new added
3. after we introduced async generation - there is a possibility that error message will be swallowed

Alexander Kislitsky (akislitsky) on 2015-04-20

tags:

added: tricky

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	In Progress	High	Kamil Sambor	Fuel for OpenStack 6.1
6.0.x	Won't Fix	Medium	Fuel Python (Deprecated)	Fuel for OpenStack 6.0-updates
6.1.x	In Progress	High	Kamil Sambor	Fuel for OpenStack 6.1

Fuel for OpenStack

Adding a multirole node "Controller+Cinder" instead single-roles nodes "Controller" and "Cinder" get error deployment

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches