Nodes stuck in "provisioning" state despite successful reboot into newly installed OS

Bug #1455489 reported by Vladimir Kozhukalov on 2015-05-15
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Georgy Kibardin
6.1.x
High
Sergii Rizvan
7.0.x
High
Sergii Rizvan
8.0.x
High
Sergii Rizvan
Mitaka
High
Georgy Kibardin

Bug Description

According to the output of `fuel nodes' almost half of nodes stuck in the `provisioning' state.
However in fact those node have successfully rebooted into the newly installed OS a long time ago (~ 20 -- 30 minutes).

The problem is that the mcollective service uses a wrong config file which specifies
activemq as a messaging backend, as a result the node can't notify fuel that provisioning has
successfuly completed. The /var/log/mcollective.log on such a node looks like

# Logfile created on 2016-05-24 11:51:19 +0000 by logger.rb/31641
I, [2016-05-24T11:51:19.708422 #7999] INFO -- : config.rb:146:in `loadconfig' PLMC1: The Marionette Collective version 2.3.3 started by /usr/sbin/mcollectived using config file /etc/mcollective/server.cfg
I, [2016-05-24T11:51:19.708508 #7999] INFO -- : mcollectived:35:in `<main>' The Marionette Collective 2.3.3 started logging at info level
W, [2016-05-24T11:51:19.709309 #7999] WARN -- : activemq.rb:274:in `connection_headers' Connecting without STOMP 1.1 heartbeats, if you are using ActiveMQ 5.8 or newer consider setting plugin.activemq.heartbeat_interval
I, [2016-05-24T11:51:19.709467 #7999] INFO -- : activemq.rb:113:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective@stomp1:6163
I, [2016-05-24T11:51:19.720273 #7999] INFO -- : activemq.rb:128:in `on_connectfail' TCP Connection to stomp://mcollective@stomp1:6163 failed on attempt 0
I, [2016-05-24T11:51:19.730528 #7999] INFO -- : activemq.rb:113:in `on_connecting' TCP Connection attempt 1 to stomp://mcollective@stomp1:6163
I, [2016-05-24T11:51:19.733415 #7999] INFO -- : activemq.rb:128:in `on_connectfail' TCP Connection to stomp://mcollective@stomp1:6163 failed on attempt 1
I, [2016-05-24T11:51:19.753646 #7999] INFO -- : activemq.rb:113:in `on_connecting' TCP Connection attempt 2 to stomp://mcollective@stomp1:6163
[repeated many many times]

There's also an error message regarding a failed mcollective configuration attempt in
/var/log/cloud-init-output.log on the affected nodes:

Reading package lists...
Building dependency tree...
Reading state information...
mcollective is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
start: Job is already running: mcollective
2016-05-24 11:51:19,768 - util.py[WARNING]: Running mcollective (<module 'cloudinit.config.cc_mcollective' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_mcollective.pyc'>) failed

Currently, cloud-init configures mcollective and then runs the following command

service mcollective start

However nailgun-agent also adjusts mcollective config (/etc/mcollective/server.cfg) and restarts the service. If nailgun-agent (or init) wins
the race and starts mcollective before cloud-init have configured it
the above start command does nothing. As a result the node can't report that the provisioning has been completed successfully, so deployment never starts.
In order to make sure that mcollective is able to see config file changes made by cloud-init it must be restarted.

* Steps to reproduce:

  - Deploy a minimal cluster: 3 controllers, 2 computes, 3 ceph-osd nodes

* Expected results:

  - Provisioning and deployment completes normally within the sane time (~ 30 -- 60 minutes)

* Actual results:

  - Quite a number of nodes (almost a half) stuck in the `provisioning' state

* Workaround:

  - manually restart the mcollective service on the affected nodes

Dmitry Pyzhov (dpyzhov) on 2015-05-15
Changed in fuel:
milestone: none → 7.0
Mike Scherbakov (mihgen) wrote :

Moved back to 6.1. If we want to move it to 7.0, we need to explain why we do not want to fix it in 6.1 and explain the user impact.

Changed in fuel:
milestone: 7.0 → 6.1
Kamil Sambor (ksambor) on 2015-05-19
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Kamil Sambor (ksambor)

The problem has many faces.
1) default mcollective config has daemonize=1 option and our current init script assumes mcollective does not go background. So when SysV first time starts mcollective with default config, the pid becomes wrong (contains wrong process number). Then when cloud-init runs 'service mcollective start', init script thinks process is not running and starts it again but with daemonize=0. We need to fix init script. https://bugs.launchpad.net/fuel/+bug/1454741 But when we fix it, mcollective will not be restarted by cloud-init. Fortunately, nailgun-agent will restart it later.
2) When nailgun-agent starts it checks if identity setting is defined in mcollective config. If it is not defined, nailgun-agent appends it and restarts mcollective. But sometimes it starts at the same time when cloud-init does its magic. So, sometimes we end up with corrupted mcollective config. https://bugs.launchpad.net/fuel/+bug/1449186
3) If init script is correct and nailgun-agent will not restart mcollective, we will end up with mcollective not restarted after its config has been changed by cloud-init. And it won't be able to connect to Rabbit. So we need either to prevent mcollective from starting during provisioning (update-rc.d mcollective disable) or we need to restart mcollective in cloud-init runcmd. https://bugs.launchpad.net/fuel/+bug/1455489 (this bug)

Kamil Sambor (ksambor) on 2015-05-19
Changed in fuel:
status: Triaged → In Progress
assignee: Kamil Sambor (ksambor) → Vladimir Kozhukalov (kozhukalov)
Alexander Gordeev (a-gordeev) wrote :

if it's ubuntu related only i can propose the possible way of fixing it: https://review.openstack.org/#/c/179128/3/fuel_agent/fuel_agent/utils/build_utils.py

it looks like mcollective should be disabled like it was done for 'puppet'

Alexei Sheplyakov (asheplyakov) wrote :

> it looks like mcollective should be disabled like it was done for puppet

No, puppet agent is disabled for a different reason: we don't use a central server (instead the manifests are copied to each node with rsync).

Disabling mcollectived will surely break the deployment (unless cloud-init or nailgun agent re-enable mcollective)

Reviewed: https://review.openstack.org/184230
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=2c8daa42f8a6afd65f23b0ff046389163482140a
Submitter: Jenkins
Branch: master

commit 2c8daa42f8a6afd65f23b0ff046389163482140a
Author: Vladimir Kozhukalov <email address hidden>
Date: Tue May 19 14:53:09 2015 +0300

    Restart mcollective at the end of cloud-init magic

    We use mcollective cloud-init module to configure
    mcollective service. This module runs the command
    'service mcollective start' which does nothing
    if the service is already running (which is the actual case).
    So, we can end up with mcollective unable to see
    config file changed made by cloud-init and hence
    prevent mcollective from being able to connect to
    rabbit.

    This patch restarts mcollective at the end of cloud-init run.

    Closes-Bug: #1455489
    Change-Id: I9ea122acc168120596bf48f378e92874936bcbe1

Changed in fuel:
status: In Progress → Fix Committed
tags: added: customer-found
Dmitry Pyzhov (dpyzhov) on 2016-04-13
no longer affects: fuel/newton
tags: added: fuel-py
tags: added: fuel-python
removed: fuel-py
tags: added: area-python
removed: fuel-python

Fix proposed to branch: master
Review: https://review.openstack.org/305637

Changed in fuel:
assignee: Vladimir Kozhukalov (kozhukalov) → Georgy Kibardin (gkibardin)
status: Confirmed → In Progress

Original review https://review.openstack.org/184230 is for project fuel-web while the fix is for fuel-agent. Probably this was the reason it was merged to nowhere.

Georgy Kibardin (gkibardin) wrote :

Finally, I've figured the fix has been removed later during the fix of #1536924

Change abandoned by Georgy Kibardin (<email address hidden>) on branch: master
Review: https://review.openstack.org/305637

The right solution would be to fix cloudinit so that it would ensure that mcollective service is stopped before fixing it config and attempting to start it.
However, stopping mcollective before running cloudinit would be quite good walkaround.

Change abandoned by Georgy Kibardin (<email address hidden>) on branch: master
Review: https://review.openstack.org/305711
Reason: It turned out that bootcmd is executed too early, most likely before mcollective is even started

Change abandoned by Georgy Kibardin (<email address hidden>) on branch: master
Review: https://review.openstack.org/306275
Reason: Sorry, forgot to get rid of previous commit

Change abandoned by Georgy Kibardin (<email address hidden>) on branch: master
Review: https://review.openstack.org/306279

It looks like we cannot stop mcollective reliably - it is just impossible to be in time and stop it after it is started, but before cloudinit tries to start it.
Removing mcollective package completely and letting cloudinit install, configure and start it also doesn't work - there is a dependency on that package in fuel-agent.

Alexander Gordeev (a-gordeev) wrote :

> it is just impossible to be in time and stop it after it is started, but before cloudinit tries to start it.

I think it's possible.

It seems we can reorder execution order of cloud-init modules via cloud.cfg

so, 'bootcmd' or 'runcmd' will be executed exactly prior 'mcollective'

https://github.com/number5/cloud-init/blob/master/config/cloud.cfg

could that work? could you check that?

Georgy Kibardin (gkibardin) wrote :

Yes, we can move bootcmd right before mcollective. However, this may break some other future logic which may decide to rely upon bootcmd. In this case better solution would be introduce dedicated module for that. However even in this case it is still possible that mcollective will be started after this module, but before cloud-init reconfigures this

Georgy Kibardin (gkibardin) wrote :
Georgy Kibardin (gkibardin) wrote :

BTW, as an option:
we may spoil mcollective config so that it would fail to start, then cloud-init would fix it :)

Georgy Kibardin (gkibardin) wrote :

Setting logfile parameter to non-existing path prevent it from starting.

Alexander Gordeev (a-gordeev) wrote :

in fact, mcollective service is not supposed to start automatically anyway due to :
https://github.com/openstack/fuel-agent/blob/master/fuel_agent/utils/build.py#L208-L209

any reason why that didn't work for some cases?

Georgy Kibardin (gkibardin) wrote :

The reason mcollective is started despite /etc/init/mcollective.override is present is that in some circumstances it may happen that cron may start nailgun-agent BEFORE cloudinit reconfigures mcollective. In this case nailgun-agent doesn't find "identity" field in mcollective config, tries to write something there and then restart mcollective.

Reviewed: https://review.openstack.org/308946
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=50bcd8448ecab85194b4559569ebc8568191e481
Submitter: Jenkins
Branch: master

commit 50bcd8448ecab85194b4559569ebc8568191e481
Author: Georgy Kibardin <email address hidden>
Date: Thu Apr 21 14:52:43 2016 +0300

    Do restart mcollective after provisioning

    nailgun agent must not change mcollective config after provisioning
    cloudinit is responsible for that.

    It may happen that nailgun agent may be started by cron earlier than
    it is reconfigured and started by cloudinit. This leads to the
    situation when cloudinit reconfigures mcollective and issues start
    which doesn't do anything since it is already started by nailgun agent.
    Finally, mcollective is left started with incorrect default
    configuration.

    Change-Id: I0c6f3720943ad21e22899368832e451bc906b098
    Closes-Bug: #1455489

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/314711
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=8b0ef4e707e6afaae1863978d8b5dc94eb6315be
Submitter: Jenkins
Branch: stable/mitaka

commit 8b0ef4e707e6afaae1863978d8b5dc94eb6315be
Author: Georgy Kibardin <email address hidden>
Date: Thu Apr 21 14:52:43 2016 +0300

    Do restart mcollective after provisioning

    nailgun agent must not change mcollective config after provisioning
    cloudinit is responsible for that.

    It may happen that nailgun agent may be started by cron earlier than
    it is reconfigured and started by cloudinit. This leads to the
    situation when cloudinit reconfigures mcollective and issues start
    which doesn't do anything since it is already started by nailgun agent.
    Finally, mcollective is left started with incorrect default
    configuration.

    Change-Id: I0c6f3720943ad21e22899368832e451bc906b098
    Closes-Bug: #1455489
    (cherry picked from commit 50bcd8448ecab85194b4559569ebc8568191e481)

The fix relies on a "magic hostname" which is ugly as hell. Also leaves open quite a number of races, such as

- upstart starts cloud-init
- upstart starts cron
- cloud-init sets hostname and gets pre-emptied
- cron starts nailgun-agent
- nailgun-agent checks the hostname and does NOT update the node ID in server.cfg

Result: the node can't report itself to the master node

Please note: just because set_hostname module is started at the 'init' stage (as soon as / and /run
get mounted) doesn't mean it will complete before cron has been started (cron does not depend on
any cloud-init jobs). Also note that starts on foo' means it's OK to run this job after the `foo'
event has occured. However it doesn't mean all 'starts on foo' job will start immediately after `foo'.
Upstart will do its best to start them within the (foo, +infinity) time interval.

So the correct fix is to have exactly one entity which manages /etc/mcollective/server.cfg.
The most natural choice is cloud-init. The script installed by fuel-agent correctly fills in
the node ID (see https://github.com/openstack/fuel-agent/blob/master/cloud-init-templates/cloud_config_centos.jinja2#L69). So nailgun-agent should just stop fiddling with mcollectived
 service and its config in order to fix this bug for real

summary: - Cloud-init must restart mcollective after configuring
+ Nodes stuck in "provisioning" state despite successful reboot into newly
+ installed OS
description: updated
description: updated
Georgy Kibardin (gkibardin) wrote :

Alexei,

Magic hostname is defenitely not the best solution. To distinguish between provisioned/non-provisioned modes it is better to rely upon more explicit configuration option. The decision about single point of configuration also looks like the best way to proceed, so lets create a tech debt ticket.

As for races.

> Also https://review.openstack.org/#/c/308946/4 introduces a different bug:
>
>- upstart starts cloud-init
>- upstart starts cron
>- cloud-init sets hostname and gets pre-emptied
>- cron starts nailgun-agent
>- nailgun-agent checks the hostname and does NOT update the node ID in server.cfg
>
>Result: the node can't report itself to the master node

When we boot into bootstrap cloud-init is not used, hostname is preset and nailgun-agent writes node id into server.cfg
When we boot into bootstrapped OS cloud-config takes care about server.cfg.

At what point does the race you are talking about occur?

Georgy Kibardin (gkibardin) wrote :
tags: added: on-verification

Wasn't able to reproduce on 9.0-mos #450

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 450
cat /etc/fuel_build_number:
 450
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6347.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8723.noarch
 python-packetary-9.0.0-1.mos140.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-migrate-9.0.0-1.mos8435.noarch
 rubygem-astute-9.0.0-1.mos748.noarch
 fuel-mirror-9.0.0-1.mos140.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-openstack-metadata-9.0.0-1.mos8723.noarch
 fuel-notify-9.0.0-1.mos8435.noarch
 nailgun-mcagents-9.0.0-1.mos748.noarch
 python-fuelclient-9.0.0-1.mos323.noarch
 fuel-9.0.0-1.mos6347.noarch
 fuel-utils-9.0.0-1.mos8435.noarch
 fuel-setup-9.0.0-1.mos6347.noarch
 fuel-misc-9.0.0-1.mos8435.noarch
 fuel-library9.0-9.0.0-1.mos8435.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2715.noarch
 fuel-ostf-9.0.0-1.mos935.noarch
 fuelmenu-9.0.0-1.mos272.noarch
 fuel-nailgun-9.0.0-1.mos8723.noarch

tags: removed: on-verification

Reviewed: https://review.openstack.org/323333
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=8fe27fe12f5fb758ea7cb3fecf6e69b36db65200
Submitter: Jenkins
Branch: stable/8.0

commit 8fe27fe12f5fb758ea7cb3fecf6e69b36db65200
Author: Georgy Kibardin <email address hidden>
Date: Thu Apr 21 14:52:43 2016 +0300

    Do restart mcollective after provisioning

    nailgun agent must not change mcollective config after provisioning
    cloudinit is responsible for that.

    It may happen that nailgun agent may be started by cron earlier than
    it is reconfigured and started by cloudinit. This leads to the
    situation when cloudinit reconfigures mcollective and issues start
    which doesn't do anything since it is already started by nailgun agent.
    Finally, mcollective is left started with incorrect default
    configuration.

    Change-Id: I0c6f3720943ad21e22899368832e451bc906b098
    Closes-Bug: #1455489
    (cherry picked from commit 50bcd8448ecab85194b4559569ebc8568191e481)

tags: added: on-verification
Ekaterina Shutova (eshutova) wrote :

Verified on MOS 8.0 + mu2 updates.

Issue is not 100% reproducible.
With updates cluster was deployed successfully, no nodes stuck in "provisioning" state.

tags: removed: on-verification

Change abandoned by Sergii Rizvan (<email address hidden>) on branch: stable/7.0
Review: https://review.openstack.org/336501

Change abandoned by Sergii Rizvan (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/337157

Sergii Rizvan (srizvan) wrote :

Changes abandoned for 6.1 and 7.0 because this patch would block mcollective identity updates on 6.1 and 7.0 just because that functionaly thoroughly relies on the 2 patches which were merged into 8.0 https://bugs.launchpad.net/fuel/+bug/1518306
Also due to rare reproducibility and huge difficulty of making the patch we're going to close the bug for 6.1 and 7.0 as Won't Fix.

Have regression in MOS 8.0 centos bootstrap. https://bugs.launchpad.net/fuel/+bug/1604365

It is possible to have configured hostname in "reset and redeploy scenario" because cobbler provides names over dhcp during pxe load. Ubuntu bootstrap is not affected because it never change hostname due to some lucky conditions (It rewrites rewrites hostname to static allways).

While it can be fixed in centos image in same fashion as it done in ubuntu (static hostname in image) I think it is better to discuss hostname based solution one more time.

As I see now we can do one the following.
1. Hostname solution. Do static hostname in centos (8.0 only).
2. Hostname solution. Fix cobbler in reset scenario. (all versions I think)
3. Drop hostname solution to something else.

By the way. Do we really need nailgun-agent after deployment?

Alexander Gordeev (a-gordeev) wrote :

>By the way. Do we really need nailgun-agent after deployment?

At the very least, nailgun-agent indicates to nailgun that node is online. Apart from that, I'm not aware about any other needs for nailgun-agent after deployment.

tags: added: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers