Fuel for OpenStack

Mcollective can be restarted after the deployment was started

Bug #1518306 reported by Kyrylo Galanov on 2015-11-20

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Artem Roma	Fuel for OpenStack 8.0

Bug Description

Hi,

I have analyzed the log files in the job #148. Since the environment is down, some of log files are lost.
https://172.18.160.103/view/8.0_swarm/job/8.0.system_test.ubuntu.thread_1/48/

Astute starts deployment tasks _before_ mcollective is configured on the slave nodes.

1. Deployment is started
2. The node is set up and rebooted
3. Astute starts ntpdate update task (https://bugs.launchpad.net/fuel/+bug/1504493)
4. Fuel agent is started by cron, FA gets node id from the master and restarts mcollective (https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L119)
5. Deployment fails.

Mcollective is reconfigured after 10 minutes in average: http://paste.openstack.org/show/479578/

Best regarsd,
Kyrylo

Tags:

Kyrylo Galanov (kgalanov) on 2015-11-20

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → Critical
assignee:	nobody → Fuel Core Team (fuel-core)
milestone:	none → 8.0
tags:	added: area-python

Dmitry Pyzhov (dpyzhov) on 2015-11-20

Changed in fuel:
assignee:	Fuel Core Team (fuel-core) → Fuel Python Team (fuel-python)
importance:	Critical → High

Dmitry Pyzhov (dpyzhov) on 2015-11-20

tags:

added: team-bugfix

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-11-23:

I believe that we are missing fuel nailgun agent run on every server boot (we should start it right away and not to wait for cron to start our script).
We whether can use @reboot clause in cron script (if it's supported and does what's needed), or have script run by /etc/rc.d. We should prevent two copies of script running at the same time though.

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-11-23:

Reassigned to fuel-library to investigate / fix this.

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Fuel Library Team (fuel-library)

Revision history for this message

Ihor Kalnytskyi (ikalnytskyi) wrote on 2015-11-24:

Or probably use fuel-agent's cloud-init to create /etc/nailgun_uid that's used, afaik, by mcollective.

Maciej Kwiek (maciej-iai) on 2015-11-24

tags:

added: area-library
removed: area-python

Revision history for this message

Kyrylo Galanov (kgalanov) wrote on 2015-11-24:

Hello guys,

According to the logs nailgun was started a couple of times after reboot. It is run every minute. However, it received new id later.
This bug fails the deployment occasionally and cannot be reproduced each time.

--
Kyrylo

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)

Dmitry Pyzhov (dpyzhov) on 2015-11-24

tags:

added: area-python
removed: area-library

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-11-24:

Kyrylo,
what do you mean by Nailgun started a couple of times after reboot? This may affect many other scenarios, not just deployment. Why was it assigned back to fuel-python team?

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-11-27:

What if we add check if puppet process exist, do not restart mcollective. It it is 100% solution, but is workaround which will protect us.

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-11-27:

Sorry, it is not 100% solution, but it will work before we find normal way

Artem Roma (aroma-x) on 2015-12-02

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Artem Roma (aroma-x)

Revision history for this message

Artem Roma (aroma-x) wrote on 2015-12-14:

Ok, nailgun-agent implements this trick [1] in order to randomise time of http request to nailgun hence the latter is not overflowed with http traffic from all present nodes. Apparently this also causes the issue here. One of the responsibilities of the agent is setting 'identity' key - id of the node where all this happens - for mcollective config when provision process has ended and following restart of the service. Sometimes this may happen just after the deployment started.

I believe we should somehow pre-configure the config with the key before nailgun-agent is executed after the provision. AFAIK, this may be done via scripts of fuel-agent we just need to figure out how to obtain node-id on this stage.

[1]: https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L753-l757

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-14: Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/257332

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-14: Related fix proposed to fuel-agent (master)

#10

Related fix proposed to branch: master
Review: https://review.openstack.org/257340

Artem Roma (aroma-x) on 2015-12-16

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-18: Related fix merged to fuel-web (master)

#11

Reviewed: https://review.openstack.org/257332
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=98f2d58542eeb1a5690a27d1fe12811614374961
Submitter: Jenkins
Branch: master

commit 98f2d58542eeb1a5690a27d1fe12811614374961
Author: Artem Roma <email address hidden>
Date: Mon Dec 14 14:39:53 2015 +0200

Add 'identity' key to mcollective configuration data

    'identity' parameter represents id of the node and is placed in
    mcollective config by nailgun-agent. But sometimes such behavior
    (especially if to take into consideration that restart of mcollective
    follows it) may lead to failed deployment (See related bug). Now the parameter
    is supplied by nailgun and is used by fuel-agent to create the config
    with the data already present in it when node boots after provision is
    done.

Change-Id: I753eb76ed9c3b80f249c0c4b86ef48ef49274990
Related-Bug: #1518306

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-18: Related fix merged to fuel-agent (master)

#12

Reviewed: https://review.openstack.org/257340
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=066334291ac44d6454516c6a8e99119152c2e6b5
Submitter: Jenkins
Branch: master

commit 066334291ac44d6454516c6a8e99119152c2e6b5
Author: Artem Roma <email address hidden>
Date: Mon Dec 14 15:04:04 2015 +0200

Add processing of 'identity' parameter for mcollective config

    Nailgun-agent provided the parameter for the config and restarts
    mcollective after update. But in some cases (see description of the
    related bug) such behavior may cause deployment failure. So now the data
    is supplied by astute in provision info and is placed into config on its
    creation as other parameters.

    Change-Id: I3670e571c13808da2b54bd6238d228e7cdb0ef96
    Related-Bug: #1518306
    Depends-On: I753eb76ed9c3b80f249c0c4b86ef48ef49274990

Artem Roma (aroma-x) on 2015-12-21

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-02-02:

#13

verified on 8.0-506

Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.