Deployment with Ceph multiroles has failed

Bug #1411660 reported by Anastasia Palkina
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dima Shulyak
6.0.x
Invalid
Undecided
Unassigned

Bug Description

"build_id": "2015-01-15_22-54-45",
"ostf_sha": "92ad9f8e4c509c82e07ceb093b5d579205c76014",
"build_number": "63",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "051d23b22c21eab39c968372a5d40727c2b66281",
"production": "docker",
"fuelmain_sha": "",
"astute_sha": "82125b0eef4e5a758fd4afa8917812e09a1f7dac",
"feature_groups": ["mirantis"],
"release": "6.1",
"release_versions": {"2014.2-6.1": {"VERSION": {"build_id": "2015-01-15_22-54-45", "ostf_sha": "92ad9f8e4c509c82e07ceb093b5d579205c76014", "build_number": "63", "api": "1.0", "nailgun_sha": "051d23b22c21eab39c968372a5d40727c2b66281", "production": "docker", "fuelmain_sha": "", "astute_sha": "82125b0eef4e5a758fd4afa8917812e09a1f7dac", "feature_groups": ["mirantis"], "release": "6.1", "fuellib_sha": "59af43598682f4f0c5aebf584a959ac730a4d86d"}}},
"fuellib_sha": "59af43598682f4f0c5aebf584a959ac730a4d86d"

First deployment:
1. Create new environment (CentOS)
2. Choose nova-network, vlan manager
3. Choose Ceph for images
4. Choose Sahara and Ceilometer
5. Add 1 controller+ceph, 1 compute+ceph, 1cinder+ceph, 2mongo
6. Start deployment. It has failed with error on controller (node-1)

2015-01-16 12:32:36 ERR

 (/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) change from notrun to 0 failed: ceph-deploy osd prepare node-1:/dev/sdb4 node-1:/dev/sdc4 returned 1 instead of one of [0]

Second deployment:
1. Create new environment (CentOS)
2. Choose neutron GRE
3. Choose Ceph for images
4. Choose Murano and Ceilometer
5. Add 1 controller+mongo, 1 compute+ceph+cinder, 1 cinder+mongo, 1 ceph
6. Start deployment. It has failed with error on compute (node-7):

2015-01-16 12:42:22 ERR

 (/Stage[main]/Ceph::Conf/Exec[ceph-deploy config pull]/returns) change from notrun to 0 failed: ceph-deploy --overwrite-conf config pull node-6 returned 1 instead of one of [0]

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Changed in fuel:
status: New → Confirmed
Revision history for this message
Ryan Moe (rmoe) wrote :

In the first deployment it appears that the ceph-mon role wasn't run on node-1 (I don't see it running ceph-deploy mon create or calling gatherkeys). Also, Puppet was not run on node-4 (role=primary-mongo) but it was on node-5 (role=mongo).

In the second deployment Puppet wasn't run on on node-6 (primary-mongo + primary-controller) at all which meant no ceph-mon again.

Ryan Moe (rmoe)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Ryan Moe (rmoe)
Revision history for this message
Ryan Moe (rmoe) wrote :

The issue seems to be having mongo nodes in the environment. When a controller+ceph role is present and there are mongo nodes in the environment the tasks on the controller run out of order. It will attempt to run ceph-osd before it runs the controller role (which will obviously fail). Removing the mongo nodes from the environment allows the deployment to succeed.

The deployment of mongo fails (see: https://bugs.launchpad.net/fuel/+bug/1419108 ) and I'm not sure if this is influencing the order of tasks on the controller.

Revision history for this message
Ryan Moe (rmoe) wrote :

It seems that this is only a problem when you have a single HA controller. Adding additional controllers works ok in this scenario.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/154677

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Ryan Moe (rmoe) wrote :

This issue should only occur with an environment that contains only one controller AND nodes with roles that depend only on 'controller'.

With the environment configuration described here we end up with a deployment order of:
['mongo', 'compute', 'cinder', 'ceph-osd']
['primary-mongo']
['primary-controller']

instead of the correct order:
['mongo']
['primary-mongo']
['primary-controller']
['compute', 'cinder', 'ceph-osd']

I believe the root cause is incorrect dependencies on the ceph-osd, compute, and cinder groups that causes them to be run out of order.

To order the groups for deployment we start with a list of root nodes in the graph (zabbix-server and base-os) and a list of all groups that have already been processed [0]. This initial list of already-processed nodes contains the difference between all available groups and the groups in our environment. For this particular deployment scenario we start with this: ['base-os', 'controller', 'zabbix-server'] (these are the roles that are not assigned to any node in the environment).

After the priorities are processed for the current groups (zabbix-server and base-os) we get the next set of groups to process [1]. Getting the next groups involves iterating over all roles in the graph and checking [2]:
1. That the current role's predecessors are in our list of processed nodes.
2. That the current role has not already been processed.

The failure in this case is that 'controller' is in the list of already processed groups. Because ceph-osd, compute, and cinder depend only on controller they pass check number one. They also do not exist in the list of processed nodes (as shown above) so the second check is satisfied. This means that cinder, compute, and ceph-osd all get added in parallel with the mongo group (which also passes both checks at this point). The primary-controller is not added at this time because its predecessors have not been processed (mongo and primary-mongo are in the environment and therefore not in the list of already-processed nodes yet).

This is not a problem with one controller and no mongo because in this case 'mongo' and 'primary-mongo' are in the list of already processed nodes (because they're not in the environment). This means that the first time primary-controller is checked its predecessors have already been processed and primary-controller is added to the groups to process.

[0] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_graph.py#L343-L344
[1] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_graph.py#L95
[2] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_graph.py#L103-L104

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Thanks a lot, this is very helpfull.
We need to change traversal for groups to be based not simply on predecessors, but on all requirements found during
traversal.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/154828

Changed in fuel:
assignee: Ryan Moe (rmoe) → Dima Shulyak (dshulyak)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Ryan Moe (<email address hidden>) on branch: master
Review: https://review.openstack.org/154677

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/154828
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=4a90555aa4a707e2c55d82a639e47b35ff2d97f6
Submitter: Jenkins
Branch: master

commit 4a90555aa4a707e2c55d82a639e47b35ff2d97f6
Author: Dmitry Shulyak <email address hidden>
Date: Wed Feb 11 12:41:32 2015 +0200

    Use all dependencies during groups traversal

    Using direct predecessors during groups traversal can lead
    to wrong ordering in cases where predecessor is skipped.

    As example assignment of mongo, one controller and dependent on
    controller role, such as cinder, will result in next order:

    mongo, cinder
    primary-controller

    However cinder have dependency on controller, therefore full
    path should be taken into account

    Closes-bug: #1411660
    related to blueprint granular-deployment-based-on-tasks

    Change-Id: I7530642ec600edaec9d23dba359a7c19e112f200

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Alex Ermolov, why do you think that this bug is Invalid for 6.0.1? Please give an explanation for every bug you are moving to Invalid state.

Revision history for this message
Dima Shulyak (dshulyak) wrote :

It should not even affect 6.0.1, this was a bug caused by granular deployment

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #141

"build_id": "2015-02-24_09-36-08", "ostf_sha": "1a0b2c6618fac098473c2ed5a9af11d3a886a3bb", "build_number": "141", "release_versions": {"2014.2-6.1": {"VERSION": {"build_id": "2015-02-24_09-36-08", "ostf_sha": "1a0b2c6618fac098473c2ed5a9af11d3a886a3bb", "build_number": "141", "api": "1.0", "nailgun_sha": "3df73a3cfdea921260bb440b3f572820c76eb01b", "production": "docker", "python-fuelclient_sha": "5657dbf06fddb74adb61e9668eb579a1c57d8af8", "astute_sha": "d81ff53c2f467151ecde120d3a4d284e3b5b3dfc", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "b975019fabdb429c1869047df18dd792d2163ecc", "fuellib_sha": "f94e5a2e5c08428a80e227d8c6e545debe578dfc"}}}, "auth_required": true, "api": "1.0", "nailgun_sha": "3df73a3cfdea921260bb440b3f572820c76eb01b", "production": "docker", "python-fuelclient_sha": "5657dbf06fddb74adb61e9668eb579a1c57d8af8", "astute_sha": "d81ff53c2f467151ecde120d3a4d284e3b5b3dfc", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "b975019fabdb429c1869047df18dd792d2163ecc", "fuellib_sha": "f94e5a2e5c08428a80e227d8c6e545debe578dfc"

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.