Node deletion failed from HA Ubuntu cluster with MCollective is not running on nodes: 1. MCollective must be running to properly delete a node.

Bug #1458623 reported by Andrey Sledzinskiy
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Sharshov

Bug Description

{

    "build_id": "2015-05-24_15-51-50",
    "build_number": "462",
    "release_versions":

{

    "2014.2.2-6.1":

{

    "VERSION":

{

    "build_id": "2015-05-24_15-51-50",
    "build_number": "462",
    "api": "1.0",
    "fuel-library_sha": "889c2534ceadf8afd5d1540c1cadbd913c0c8c14",
    "nailgun_sha": "76441596e4fe6420cc7819427662fa244e150177",
    "feature_groups":

            [
                "mirantis"
            ],
            "openstack_version": "2014.2.2-6.1",
            "production": "docker",
            "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce",
            "astute_sha": "0bd72c72369e743376864e8e8dabfe873d40450a",
            "fuel-ostf_sha": "9a5f55602c260d6c840c8333d8f32ec8cfa65c1f",
            "release": "6.1",
            "fuelmain_sha": "5c8ebddf64ea93000af2de3ccdb4aa8bb766ce93"
        }
    }

},
"auth_required": true,
"api": "1.0",
"fuel-library_sha": "889c2534ceadf8afd5d1540c1cadbd913c0c8c14",
"nailgun_sha": "76441596e4fe6420cc7819427662fa244e150177",
"feature_groups":

    [
        "mirantis"
    ],
    "openstack_version": "2014.2.2-6.1",
    "production": "docker",
    "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce",
    "astute_sha": "0bd72c72369e743376864e8e8dabfe873d40450a",
    "fuel-ostf_sha": "9a5f55602c260d6c840c8333d8f32ec8cfa65c1f",
    "release": "6.1",
    "fuelmain_sha": "5c8ebddf64ea93000af2de3ccdb4aa8bb766ce93"

}

Steps:
1. Create next cluster - HA, Ubuntu, flat nova-network, 1 controller, 1 compute
2. Deploy it
3. After deployment delete compute node
4. Start re-deployment

Actual result - it failed with MCollective is not running on nodes: 1. MCollective must be running to properly delete a node.

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
assignee: nobody → Fuel Astute Team (fuel-astute)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I investigate environment and got strange behavior.

mco ping

[root@nailgun ~]# mco ping
master time=49.07 ms
3 time=54.27 ms

Both deployed nodes: 1 and 2 are offline. 3 is on bootstrap stage.

After restarting mcollective service on node 1 node is back online.

[root@nailgun ~]# mco ping
1 time=57.20 ms
master time=58.32 ms
3 time=59.49 ms

Changed in fuel:
status: New → Confirmed
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
tags: added: mcollective module-astute
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I think it can be connected with this changes applied early: https://bugs.launchpad.net/fuel/+bug/1454741

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Vladimir,

> I think it can be connected with this changes applied early: https://bugs.launchpad.net/fuel/+bug/1454741

Could you please clarify what's exactly wrong with mcollective init script (and/or config file)?

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

After nodes recovered from snapshot, mcollective do not reconnect to the server. If we restart it, it will works as before.

I suggest to recovery env [0] and check it.

http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.ubuntu.thread_2/139/testReport/junit/(root)/ha_one_controller_flat_node_deletion/ha_one_controller_flat_node_deletion/

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I've checked logs and in env after snapshot revert no record to mcollective logs. Looks like it is just freeze. I think we need to discovery such behavior.

I do not for sure that mcollective init script is the source of problem. This is just last major change i known for mcollective service.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Folks, if we can't repro this manually without snapshot reverts, I think we should tag it as non-release.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Mike I am not sure that exactly snapshot and revert causes this error(according to we have have passed status for this tests with snapshot/revert http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.ubuntu.thread_2/118/) Seems that QA team should manually check 2 cases: deploy env / delete node / delete env and second case: deploy env / reboot slaves / delete node (to exclude the problem with init scripts) Also we see the same issue for node additional , after redeployment - only node that we add starts to answer for mco ping

Also interesting things that on centos all those teats are passed http://jenkins-product.srt.mirantis.net:8080/view/6.1_swarm/job/6.1.system_test.centos.thread_2/136/testReport/ so issue is reproduce only on ubuntu

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :
Download full text (5.5 KiB)

Guys, looks like we have some unexpected data in server.cfg about plugin.activemq. We do not this data at all in config.
Also this data affect mcollective behavior. We try to use activemq. And after it no logs record about heartbeat.

I think we have good chance to solve this problem if we remove this section from server.cfg in provisioned nodes.

In bootstrap we use this section:

connector = rabbitmq
plugin.rabbitmq.vhost = mcollective
plugin.rabbitmq.pool.size = 1
plugin.rabbitmq.pool.1.host = 10.109.0.2
plugin.rabbitmq.pool.1.port = 61613
plugin.rabbitmq.pool.1.user = mcollective
plugin.rabbitmq.pool.1.password= HPkTOIK3
plugin.rabbitmq.heartbeat_interval = 30

But in provisioned env:

connector = rabbitmq
plugin.activemq.pool.size = 1
plugin.activemq.pool.1.host = stomp1
plugin.activemq.pool.1.port = 6163
plugin.activemq.pool.1.user = mcollective
plugin.activemq.pool.1.password = marionette

# Facts
factsource = yaml
plugin.yaml = /etc/mcollective/facts.yaml
plugin.rabbitmq.pool.1.password = HPkTOIK3
plugin.rabbitmq.heartbeat_interval = 30
plugin.rabbitmq.vhost = mcollective
direct_addressing = 1
plugin.rabbitmq.pool.1.host = 10.109.0.2
plugin.rabbitmq.pool.1.port = 61613
ttl = 4294957
plugin.rabbitmq.pool.size = 1
plugin.rabbitmq.pool.1.user = mcollective

Connection log:

2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.586827 #7605] INFO -- : mcollectived:31:in `<main>' The Marionette Collective @DEVELOPMENT_VERSION@ started logging at info level
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.616589 #7616] INFO -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective@stomp1:6163
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.641641 #7616] INFO -- : activemq.rb:111:in `on_connectfail' TCP Connection to stomp://mcollective@stomp1:6163 failed on attempt 0
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.651914 #7616] INFO -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 1 to stomp://mcollective@stomp1:6163
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.658442 #7616] INFO -- : activemq.rb:111:in `on_connectfail' TCP Connection to stomp://mcollective@stomp1:6163 failed on attempt 1
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.678674 #7616] INFO -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 2 to stomp://mcollective@stomp1:6163
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.680422 #7616] INFO -- : activemq.rb:111:in `on_connectfail' TCP Connection to stomp://mcollective@stomp1:6163 failed on attempt 2
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.720662 #7616] INFO -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 3 to stomp://mcollective@stomp1:6163
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.723436 #7616] INFO -- : activemq.rb:111:in `on_connectfail' TCP Connection to stomp://mcollective@stomp1:6163 failed on attempt 3
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.803709 #7616] INFO -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 4 to stomp://mcollective@stomp1:61...

Read more...

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Fuel OSCI Team (fuel-osci)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

OSCI team has nothing to do with the bug

Changed in fuel:
assignee: Fuel OSCI Team (fuel-osci) → nobody
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

main_collective = mcollective
collectives = mcollective
libdir = /usr/share/mcollective/plugins
logfile = /var/log/mcollective.log
loglevel = info
daemonize = 1
The default server.cfg shipped with the mcollective package is:

# Plugins
securityprovider = psk
plugin.psk = unset

connector = activemq
plugin.activemq.pool.size = 1
plugin.activemq.pool.1.host = stomp1
plugin.activemq.pool.1.port = 6163
plugin.activemq.pool.1.user = mcollective
plugin.activemq.pool.1.password = marionette

# Facts
factsource = yaml
plugin.yaml = /etc/mcollective/facts.yaml

cloud-init *and* nailgun-agent are supposed to replace the default configuration with a correct one.
I guess there's a race between them so server.cfg gets corrupted.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Although the server.cfg looks funny it's correct:

connector = rabbitmq
factsource = yaml
plugin.yaml = /etc/mcollective/facts.yaml
plugin.rabbitmq.pool.1.password = HPkTOIK3
plugin.rabbitmq.heartbeat_interval = 30
plugin.rabbitmq.vhost = mcollective
plugin.rabbitmq.pool.1.host = 10.109.0.2
plugin.rabbitmq.pool.1.port = 61613
plugin.rabbitmq.pool.size = 1
plugin.rabbitmq.pool.1.user = mcollective

#7616] INFO -- : activemq.rb:111:in `on_connectfail' TCP Connection to stomp://mcollective@stomp1:6163 failed on attempt 1
2015-05-25T00:57:33.579989+00:00 debug: I, [2015-05-25T00:55:59.678674 #7616] INFO -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 2 to stomp://mcollective@stomp1:6163

Could you please post (or attach) the complete mcollectived log? A few such messages is OK
(upstart has launched mcollectived before cloud-init had a chance to create a correct configuration).
We need to find out if mcollectived ever tried to connect to rabbitmq.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

In order to avoid the above confusion one can disable mcollectived by default (when building the IBP image) and have cloud-init to enable/start it after making a proper server.cfg

Changed in fuel:
assignee: nobody → Fuel provisioning team (fuel-provisioning)
Revision history for this message
Egor Kotko (ykotko) wrote :

{"build_id": "2015-05-24_15-51-50", "build_number": "462", "release_versions": {"2014.2.2-6.1": {"VERSION": {"build_id": "2015-05-24_15-51-50", "build_number": "462", "api": "1.0", "fuel-library_sha": "889c2534ceadf8afd5d1540c1cadbd913c0c8c14", "nailgun_sha": "76441596e4fe6420cc7819427662fa244e150177", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce", "astute_sha": "0bd72c72369e743376864e8e8dabfe873d40450a", "fuel-ostf_sha": "9a5f55602c260d6c840c8333d8f32ec8cfa65c1f", "release": "6.1", "fuelmain_sha": "5c8ebddf64ea93000af2de3ccdb4aa8bb766ce93"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "889c2534ceadf8afd5d1540c1cadbd913c0c8c14", "nailgun_sha": "76441596e4fe6420cc7819427662fa244e150177", "feature_groups": ["mirantis", "experimental"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "e19f1b65792f84c4a18b5a9473f85ef3ba172fce", "astute_sha": "0bd72c72369e743376864e8e8dabfe873d40450a", "fuel-ostf_sha": "9a5f55602c260d6c840c8333d8f32ec8cfa65c1f", "release": "6.1", "fuelmain_sha": "5c8ebddf64ea93000af2de3ccdb4aa8bb766ce93"}

Could reproduce this with following steps:

Steps:
1. Create next cluster - HA, Ubuntu, flat Neutron Vlan, 1 controller, 1 compute
2. Deploy it
3. After deployment delete compute node
4. Start re-deployment
5. Add 2 controllers but not deploy cluster
6. Start Network Verification

Actual result:
Verification failed.
Method verify_networks. Network verification not avaliable because nodes ["1"] not avaliable via mcollective. Inspect Astute logs for the details

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Egor,

I think you've experienced a different problem. Please file a new bug.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Egor,

i've checked logs and env. Looks like this is problem with bootstrap node which restored from snapshot. I've restarted node and it helps.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/186000

Changed in fuel:
assignee: Fuel provisioning team (fuel-provisioning) → Aleksandr Gordeev (a-gordeev)
status: Confirmed → In Progress
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Guys, we have regression. For Ubuntu instead of using Mcollective version 2.3.3 which we use for Ubuntu and CentOS since December 2013 now we have 2.3.1 which have problem with heartbeat.

I think if we return old new version: http://fuel-repository.mirantis.com/fwm/6.0/ubuntu/pool/main/mcollective_2.3.3-ubuntu_all.deb instead of http://fuel-repository.mirantis.com/fwm/6.1/ubuntu/pool/main/m/mcollective/mcollective_2.3.1-0u~u14.04%2bmos2_all.deb

we solve this problem.

Changed in fuel:
assignee: Aleksandr Gordeev (a-gordeev) → MOS Linux (mos-linux)
Revision history for this message
Aleksander Mogylchenko (amogylchenko) wrote :

It is not a regression. 2.3.1 was taken from 6.0 repos:
https://review.fuel-infra.org/gitweb?p=packages%2Fprecise%2Fmcollective.git;a=shortlog;h=refs%2Fheads%2F6.0

@osci, could you please sync repos so that they were consistent with our releases (e.g. I could go to 6.0 repos and find software present on 6.0 ISO)?

Changed in fuel:
assignee: MOS Linux (mos-linux) → Fuel OSCI Team (fuel-osci)
Changed in fuel:
assignee: Fuel OSCI Team (fuel-osci) → Aleksandr Gordeev (a-gordeev)
Revision history for this message
Roman Vyalov (r0mikiam) wrote :

@Alexander - Done.

Changed in fuel:
assignee: Aleksandr Gordeev (a-gordeev) → MOS Linux (mos-linux)
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/mcollective (6.1)

Fix proposed to branch: 6.1
Change author: Aleksandr Mogylchenko <email address hidden>
Review: https://review.fuel-infra.org/7111

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Guys, we have regression. For Ubuntu instead of using Mcollective version 2.3.3 which we use for Ubuntu and CentOS since December 2013 now we have 2.3.1 which have problem with heartbeat.

Back in December 2014 I've re-packaged this version for Ubuntu 14.04:
https://review.fuel-infra.org/gitweb?p=packages/precise/mcollective.git;a=commit;h=781f13c0a1129de89dfc51711d06b36e51490e28
Nobody cared to propagate the update to 2.3.3 to packages/trusty/mcollective

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/mcollective (6.1)

Reviewed: https://review.fuel-infra.org/7111
Submitter: Aleksandr Mogylchenko <email address hidden>
Branch: 6.1

Commit: 65890025e579a35c3d4c48f1570f59d4dd36f8ff
Author: Aleksandr Mogylchenko <email address hidden>
Date: Thu May 28 02:16:33 2015

Update mcollective to 2.3.3 for MOS 6.1

Some background around this patch:
- mos6.0 was released with mcl 2.3.3, but the repository was not properly
  updated (it has 2.3.1);
- that causes the regression, resulting in mcl 2.3.1 being initially imported
  to 6.1;
- some work was done on mcl 2.3.1, mainly to fix init scripts (see
  https://review.fuel-infra.org/gitweb?p=packages/trusty/mcollective.git;a=commit;h=176926bf9e3ff5de2ad044776820c064b933b80c
  and
  https://review.fuel-infra.org/gitweb?p=packages/trusty/mcollective.git;a=commit;h=d3cac30b312dd7c6f5425be4a4bdb0d30d52ceed)

So this package contains mlc 2.3.3 with updated sys V init script.

Partial-Bug: #1458623
Change-Id: Ifbff0195e36aa1cdfa8610bb578c9b87be9717b0

Revision history for this message
Aleksander Mogylchenko (amogylchenko) wrote :

We have merged 2.3.3 package. Moving back, as I'm not clear if this is everything required to fix the problem.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Thanks! This is all that we need to fix this problem.

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/186000
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=d37f8e09bd4ac2bda6ffc8b7c0fd28dafea34f4b
Submitter: Jenkins
Branch: master

commit d37f8e09bd4ac2bda6ffc8b7c0fd28dafea34f4b
Author: Alexander Gordeev <email address hidden>
Date: Wed May 27 16:37:47 2015 +0300

    IBP: disable mcollective automatically starting for ubuntu

    By default mcollective will be started automatically.
    If it will be started prior cloud-init and nailgun-agent,
    multiple instances of mcollective service could be started with
    different server.cfg settings applied.

    This patch disables automatically starting of mcollective during
    the process of image building. Later, cloud-init will remove this
    lock to allow mcollective to be started automatically.

    Change-Id: Ibe5c41b1bb5bbf395fff14da281c4b08701c01b8
    Related-Bug: #1458623

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.