Bug #1411660 “Deployment with Ceph multiroles has failed” : Bugs : Fuel for OpenStack

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2015-01-16:

#1

fuel-snapshot-2015-01-16_12-58-53.tgz Edit (10.9 MiB, application/x-tar)

Stanislaw Bogatkin (sbogatkin) on 2015-01-20

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Ryan Moe (rmoe) wrote on 2015-01-20:

#2

In the first deployment it appears that the ceph-mon role wasn't run on node-1 (I don't see it running ceph-deploy mon create or calling gatherkeys). Also, Puppet was not run on node-4 (role=primary-mongo) but it was on node-5 (role=mongo).

In the second deployment Puppet wasn't run on on node-6 (primary-mongo + primary-controller) at all which meant no ceph-mon again.

Ryan Moe (rmoe) on 2015-02-05

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Ryan Moe (rmoe)

Revision history for this message

Ryan Moe (rmoe) wrote on 2015-02-06:

#3

The issue seems to be having mongo nodes in the environment. When a controller+ceph role is present and there are mongo nodes in the environment the tasks on the controller run out of order. It will attempt to run ceph-osd before it runs the controller role (which will obviously fail). Removing the mongo nodes from the environment allows the deployment to succeed.

The deployment of mongo fails (see: https://bugs.launchpad.net/fuel/+bug/1419108 ) and I'm not sure if this is influencing the order of tasks on the controller.

Revision history for this message

Ryan Moe (rmoe) wrote on 2015-02-10:

#4

It seems that this is only a problem when you have a single HA controller. Adding additional controllers works ok in this scenario.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-10: Fix proposed to fuel-library (master)

#5

Fix proposed to branch: master
Review: https://review.openstack.org/154677

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

Ryan Moe (rmoe) wrote on 2015-02-10:

#6

This issue should only occur with an environment that contains only one controller AND nodes with roles that depend only on 'controller'.

With the environment configuration described here we end up with a deployment order of:
['mongo', 'compute', 'cinder', 'ceph-osd']
['primary-mongo']
['primary-controller']

instead of the correct order:
['mongo']
['primary-mongo']
['primary-controller']
['compute', 'cinder', 'ceph-osd']

I believe the root cause is incorrect dependencies on the ceph-osd, compute, and cinder groups that causes them to be run out of order.

To order the groups for deployment we start with a list of root nodes in the graph (zabbix-server and base-os) and a list of all groups that have already been processed [0]. This initial list of already-processed nodes contains the difference between all available groups and the groups in our environment. For this particular deployment scenario we start with this: ['base-os', 'controller', 'zabbix-server'] (these are the roles that are not assigned to any node in the environment).

After the priorities are processed for the current groups (zabbix-server and base-os) we get the next set of groups to process [1]. Getting the next groups involves iterating over all roles in the graph and checking [2]:
1. That the current role's predecessors are in our list of processed nodes.
2. That the current role has not already been processed.

The failure in this case is that 'controller' is in the list of already processed groups. Because ceph-osd, compute, and cinder depend only on controller they pass check number one. They also do not exist in the list of processed nodes (as shown above) so the second check is satisfied. This means that cinder, compute, and ceph-osd all get added in parallel with the mongo group (which also passes both checks at this point). The primary-controller is not added at this time because its predecessors have not been processed (mongo and primary-mongo are in the environment and therefore not in the list of already-processed nodes yet).

This is not a problem with one controller and no mongo because in this case 'mongo' and 'primary-mongo' are in the list of already processed nodes (because they're not in the environment). This means that the first time primary-controller is checked its predecessors have already been processed and primary-controller is added to the groups to process.

[0] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_graph.py#L343-L344
[1] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_graph.py#L95
[2] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_graph.py#L103-L104