Mcollective can be restarted after the deployment was started

Bug #1518306 reported by Kyrylo Galanov
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Artem Roma

Bug Description

Hi,

I have analyzed the log files in the job #148. Since the environment is down, some of log files are lost.
https://172.18.160.103/view/8.0_swarm/job/8.0.system_test.ubuntu.thread_1/48/

Astute starts deployment tasks _before_ mcollective is configured on the slave nodes.

1. Deployment is started
2. The node is set up and rebooted
3. Astute starts ntpdate update task (https://bugs.launchpad.net/fuel/+bug/1504493)
4. Fuel agent is started by cron, FA gets node id from the master and restarts mcollective (https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L119)
5. Deployment fails.

Mcollective is reconfigured after 10 minutes in average: http://paste.openstack.org/show/479578/

Best regarsd,
Kyrylo

Changed in fuel:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Fuel Core Team (fuel-core)
milestone: none → 8.0
tags: added: area-python
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Core Team (fuel-core) → Fuel Python Team (fuel-python)
importance: Critical → High
Dmitry Pyzhov (dpyzhov)
tags: added: team-bugfix
Revision history for this message
Mike Scherbakov (mihgen) wrote :

I believe that we are missing fuel nailgun agent run on every server boot (we should start it right away and not to wait for cron to start our script).
We whether can use @reboot clause in cron script (if it's supported and does what's needed), or have script run by /etc/rc.d. We should prevent two copies of script running at the same time though.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Reassigned to fuel-library to investigate / fix this.

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Fuel Library Team (fuel-library)
Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Or probably use fuel-agent's cloud-init to create /etc/nailgun_uid that's used, afaik, by mcollective.

tags: added: area-library
removed: area-python
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

Hello guys,

According to the logs nailgun was started a couple of times after reboot. It is run every minute. However, it received new id later.
This bug fails the deployment occasionally and cannot be reproduced each time.

--
Kyrylo

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
removed: area-library
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Kyrylo,
what do you mean by Nailgun started a couple of times after reboot? This may affect many other scenarios, not just deployment. Why was it assigned back to fuel-python team?

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

What if we add check if puppet process exist, do not restart mcollective. It it is 100% solution, but is workaround which will protect us.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Sorry, it is not 100% solution, but it will work before we find normal way

Artem Roma (aroma-x)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Artem Roma (aroma-x)
Revision history for this message
Artem Roma (aroma-x) wrote :

Ok, nailgun-agent implements this trick [1] in order to randomise time of http request to nailgun hence the latter is not overflowed with http traffic from all present nodes. Apparently this also causes the issue here. One of the responsibilities of the agent is setting 'identity' key - id of the node where all this happens - for mcollective config when provision process has ended and following restart of the service. Sometimes this may happen just after the deployment started.

I believe we should somehow pre-configure the config with the key before nailgun-agent is executed after the provision. AFAIK, this may be done via scripts of fuel-agent we just need to figure out how to obtain node-id on this stage.

[1]: https://github.com/openstack/fuel-nailgun-agent/blob/master/agent#L753-l757

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/257332

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-agent (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/257340

Artem Roma (aroma-x)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/257332
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=98f2d58542eeb1a5690a27d1fe12811614374961
Submitter: Jenkins
Branch: master

commit 98f2d58542eeb1a5690a27d1fe12811614374961
Author: Artem Roma <email address hidden>
Date: Mon Dec 14 14:39:53 2015 +0200

    Add 'identity' key to mcollective configuration data

    'identity' parameter represents id of the node and is placed in
    mcollective config by nailgun-agent. But sometimes such behavior
    (especially if to take into consideration that restart of mcollective
    follows it) may lead to failed deployment (See related bug). Now the parameter
    is supplied by nailgun and is used by fuel-agent to create the config
    with the data already present in it when node boots after provision is
    done.

    Change-Id: I753eb76ed9c3b80f249c0c4b86ef48ef49274990
    Related-Bug: #1518306

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-agent (master)

Reviewed: https://review.openstack.org/257340
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=066334291ac44d6454516c6a8e99119152c2e6b5
Submitter: Jenkins
Branch: master

commit 066334291ac44d6454516c6a8e99119152c2e6b5
Author: Artem Roma <email address hidden>
Date: Mon Dec 14 15:04:04 2015 +0200

    Add processing of 'identity' parameter for mcollective config

    Nailgun-agent provided the parameter for the config and restarts
    mcollective after update. But in some cases (see description of the
    related bug) such behavior may cause deployment failure. So now the data
    is supplied by astute in provision info and is placed into config on its
    creation as other parameters.

    Change-Id: I3670e571c13808da2b54bd6238d228e7cdb0ef96
    Related-Bug: #1518306
    Depends-On: I753eb76ed9c3b80f249c0c4b86ef48ef49274990

Artem Roma (aroma-x)
Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

verified on 8.0-506

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.