failure of a 'critical' role marks all role members as failed

Bug #1398221 reported by Andrew Woodward
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Sustaining
7.0.x
Won't Fix
High
Fuel Python (Deprecated)
Mitaka
Invalid
High
Fuel Sustaining
Newton
Invalid
High
Fuel Sustaining
Ocata
Invalid
High
Fuel Sustaining

Bug Description

{"build_id": "2014-11-30_11-15-26", "ostf_sha": "dc66fd39d4d035bb972e4c0225591290593c459d", "build_number": "24", "auth_required": true, "api": "1.0", "nailgun_sha": "58e5f47457a0e832c005ce350e01b75a0c01b90a", "production": "docker", "fuelmain_sha": "f324b592399c544eace2f64cb499564da01ab38c", "astute_sha": "1da516b88d1a8d0014d78ab0d796e5b08379a59b", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-11-30_11-15-26", "ostf_sha": "dc66fd39d4d035bb972e4c0225591290593c459d", "build_number": "24", "api": "1.0", "nailgun_sha": "58e5f47457a0e832c005ce350e01b75a0c01b90a", "production": "docker", "fuelmain_sha": "f324b592399c544eace2f64cb499564da01ab38c", "astute_sha": "1da516b88d1a8d0014d78ab0d796e5b08379a59b", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "bbf26b499bf47ca41302ba6f62c3ebc5a493013d"}}}, "fuellib_sha": "bbf26b499bf47ca41302ba6f62c3ebc5a493013d"}

When a critical role fails, it also sets all nodes that have the same role as failed. This makes it hard to trouble shoot which node may have caused the failure. This is especially notable when one of the ceph-osd node's fails and there are 90 other nodes.

Expected result: a critical role fails, and only that node is marked as failed. the deployment is stopped.

Actual result. a critical role fails and all nodes with that role are marked failed. the deployment is stopped.

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
milestone: 6.0 → 6.1
importance: Undecided → Medium
status: New → Triaged
Dima Shulyak (dshulyak)
tags: removed: nailgun
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Please provide logs. It will help to check this problem. I have seen today logs from 90+ nodes with ceph problem (https://bugs.launchpad.net/fuel/+bug/1398096)

tags: added: astute
Andrew Woodward (xarses)
Changed in fuel:
status: Triaged → Confirmed
tags: added: module-astute
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Moving to 7.0

Changed in fuel:
milestone: 6.1 → 7.0
assignee: Vladimir Sharshov (vsharshov) → Fuel Astute Team (fuel-astute)
tags: added: qa-agree-7.0
Dmitry Pyzhov (dpyzhov)
tags: removed: astute
Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel Python Team (fuel-python)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Sylwester Brzeczkowski (sbrzeczkowski)
Revision history for this message
Sylwester Brzeczkowski (sbrzeczkowski) wrote :

I think it's an astute problem.
https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/rpc/receiver.py#L260-L286 here we set statuses for all nodes according to the data nailgun receives from astute - so there's no grouping for critical/non-critical roles.

Changed in fuel:
assignee: Sylwester Brzeczkowski (sbrzeczkowski) → Fuel Astute Team (fuel-astute)
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Assigning it to fuel-python, since we're going to remove fuel-astute group shortly as discussed in ML.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Fuel Python Team (fuel-python)
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/8.0.x
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/313260

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-astute (master)

Change abandoned by Julia Varlamova (<email address hidden>) on branch: master
Review: https://review.openstack.org/313260

no longer affects: fuel/newton
Revision history for this message
Julia Varlamova (jvarlamova) wrote :

Reproduced on 10.0 ISO build. It seems to me that failure of all critical nodes in this case can be related to puppet tasks dependencies, but it's not astute problem. Please correct me if I'm wrong.

Changed in fuel:
status: In Progress → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: added: 9-1-scope
Dmitry Pyzhov (dpyzhov)
tags: removed: 9-1-scope
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This bug should be invalid for task-based deployment, but we would need to recheck it

Changed in fuel:
importance: Medium → High
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I retested the fault tolerance behaviour for all the possible scenarios and can confirm that failure of one of the critical nodes marks only that node as in error state. Other nodes are marked as 'stopped.'

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.