Fault Tolerance is broken in Task-based Deployment

Bug #1435610 reported by Andrew Woodward
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bulat Gaifullin
Mitaka
Fix Released
High
Vladimir Sharshov

Bug Description

So I broke upload_cirros testing some code, however the deployment resulted in all of the nodes being marked as failed.

Expected result, only the node that failed cirros should be marked failed.

In this case all nodes / roles will be collected to run again which is wrong. None of the nodes should be collected for re-deployment.

Revision history for this message
Andrew Woodward (xarses) wrote :
Revision history for this message
Dima Shulyak (dshulyak) wrote :

There is a comment from Vova S. exactly on this topic:

  https://github.com/stackforge/fuel-astute/blob/master/lib/astute/deployment_engine/granular_deployment.rb#L221-223

I think that it can be improved in the next way - we should fail only nodes that have tasks assigned to them after failed one

Dima Shulyak (dshulyak)
Changed in fuel:
importance: High → Medium
Dmitry Pyzhov (dpyzhov)
tags: added: feature-tasks
Dmitry Pyzhov (dpyzhov)
tags: added: module-tasks
removed: feature-tasks
Dima Shulyak (dshulyak)
Changed in fuel:
milestone: 6.1 → 7.0
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: Medium → High
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
milestone: 7.0 → 6.1
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Guys, at now moment we could not change such behavior because of our report logic. Nodes which had already taken ready status excluding primary-controller, will be excluded from future tasks. In our case this is mean that post tasks after failed tasks can run without necessary nodes because we already inform Nailgun about 'ready' status for such nodes.

Example:

Nodes: 1,2,3
post_hook_cirros (runs on node 1)
post_hook_host (runs on all nodes(1,2,3))

If post_hook_cirros failed and we will change our code, we got such case after re-run:

post_hook_cirros (runs on node 1, because this node mark as failed)
post_hook_host (runs on 1 not on 2 and 3 because this node already marked as ready in another task)

All that we can help user is show details message about failed hook and we do such thing (sorry, i had not taken screenshot in case of cirros error, but did it for another similar error):
https://www.dropbox.com/s/ybgij3zdmm1yvw2/Nailgun-info-error.png?dl=0

I suggest to move it to 7.0

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Potential solution: we can change Nailgun behavior and send post hooks always for all nodes in cluster regardless of node status, but we need check and change where necessary all post hooks tasks, because now they suppose that will be run only on deploying nodes (excluding host hook ).

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Another solution is connecting with possible future change in other bug https://bugs.launchpad.net/fuel/+bug/1439776. They are very connected.

Changed in fuel:
status: In Progress → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: added: feature
Changed in fuel:
milestone: 6.1 → 7.0
Revision history for this message
Mike Scherbakov (mihgen) wrote :

This is not a feature, clearly. This is bug. It was not by design to be so. If we can't fix it in 6.1, let's see if need to provide some documentation piece for this.

tags: removed: feature
Revision history for this message
Aviram Bar-Haim (aviramb) wrote :

Upload_cirros.rb fails for us at the end of CentOS installations using ISOs 361 and 395.
Do we have an open bug for this issue?

Failure message:
Deployment has failed. Method granular_deploy. Failed to execute hook 'shell'.
---
priority: 800
fail_on_error: true
type: shell
uids:
- '4'
parameters:
  retries: 3
  cmd: ruby /etc/puppet/modules/osnailyfacter/modular/astute/upload_cirros.rb
  timeout: 180
  interval: 20

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Still actual and https://bugs.launchpad.net/fuel/+bug/1435610/comments/3 is best explanation why we could not change it at now moment.

tags: added: covered-by-bp
Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → nobody
assignee: nobody → Fuel Python Team (fuel-python)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Moving to 8.0. Arch limitation. Could not fix without https://blueprints.launchpad.net/fuel/+spec/progress-bar-based-on-tasks

tags: added: known-issue
Changed in fuel:
status: Confirmed → Won't Fix
tags: added: qa-agree-8.0
Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Add 'frature' tag, since it's covered by blueprint and requires changes.

tags: added: feature
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
no longer affects: fuel/8.0.x
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Changed in fuel:
milestone: 8.0 → 9.0
Changed in fuel:
milestone: 9.0 → 10.0
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Fuel Toolbox (fuel-toolbox)
Changed in fuel:
assignee: Fuel Toolbox (fuel-toolbox) → Vladimir Sharshov (vsharshov)
summary: - upload_cirros failed and marked all nodes failed
+ Fault Tolerance is broken in Task-based Deployment
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/320605

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Bulat Gaifullin (bgaifullin)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/323440

Changed in fuel:
assignee: Bulat Gaifullin (bgaifullin) → Vladimir Kuklin (vkuklin)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Bulat Gaifullin (bgaifullin)
Changed in fuel:
assignee: Bulat Gaifullin (bgaifullin) → Vladimir Kuklin (vkuklin)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Vladimir Sharshov (vsharshov)
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

ETA: 6/07

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/324808

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/323440
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=a7a6b04d3efa0218f5667eda02418793722b3aa0
Submitter: Jenkins
Branch: master

commit a7a6b04d3efa0218f5667eda02418793722b3aa0
Author: Vladimir Kuklin <email address hidden>
Date: Tue May 31 18:10:20 2016 +0300

    Add fault tolerance to task groups

    This commit is a part of defining fault tolerance groups
    for deployment to allow task executor to detect whether
    we should stop the deployment and exit in case of failure of
    tasks belonging to particular groups.

    It allows a user to specify critical nodes (e.g. by running

    Related to Change-Id I1969b953eca667c09248a6b67ffee37bfd20f474 and
    Ica2a4ae64b4dfa4f7fccfbc95108d1412c40dc3f

    Change-Id: Id866cd578c7c76dd5a1dfc43fb219e1c2ecd4abd
    Partial-bug: #1435610

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Bulat Gaifullin (bgaifullin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/325886

Changed in fuel:
assignee: Bulat Gaifullin (bgaifullin) → Vladimir Kuklin (vkuklin)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Bulat Gaifullin (bgaifullin)
Changed in fuel:
assignee: Bulat Gaifullin (bgaifullin) → Vladimir Sharshov (vsharshov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/323183
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=ebe80dc4ef0dc9d0d79216f787d2357c9a03fd1b
Submitter: Jenkins
Branch: master

commit ebe80dc4ef0dc9d0d79216f787d2357c9a03fd1b
Author: Bulat Gaifullin <email address hidden>
Date: Mon May 30 19:51:18 2016 +0300

    Added fault_tolerance_group to deployment metadata

    This property contains list of groups, that is built from
    tasks with type 'group' and each task may contain property
    fault_tolerance, that shall be moved from openstack.yaml
    to deployment tasks.
    For plugins this attribute is filled from roles_metadata
    for all tasks with type group (for backward compatibility).

    DocImpact
    Partial-Bug: 1435610
    Change-Id: I1969b953eca667c09248a6b67ffee37bfd20f474

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/326086

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Bulat Gaifullin (bgaifullin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326088

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/325886
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=54002b3308f44a1219e80df2d4f803cda4668df3
Submitter: Jenkins
Branch: master

commit 54002b3308f44a1219e80df2d4f803cda4668df3
Author: Vladimir Kuklin <email address hidden>
Date: Tue May 31 18:10:20 2016 +0300

    Adjust fault tolerance for task groups to zero tolerance for critical roles

    Set fault tolerance to 0 for critical deployment groups

    Related to Change-Id I1969b953eca667c09248a6b67ffee37bfd20f474 and
    Ica2a4ae64b4dfa4f7fccfbc95108d1412c40dc3f

    Change-Id: I5197adc796603dfb40cf1efa57427344b358d353
    Partial-bug: #1435610

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326317

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/324808
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=df1a70a4ce6af9944e8925123f5f75e0934ae201
Submitter: Jenkins
Branch: master

commit df1a70a4ce6af9944e8925123f5f75e0934ae201
Author: Alexander Kurenyshev <email address hidden>
Date: Thu Jun 2 17:56:53 2016 +0300

    Add new check for Operational cluster status

    We have a new behaviour when deployment task
    will be in a ready state even when some
    non-important nodes are in an Error state.
    This path adds check for cluster that it's
    in Operational state

    Related-Bug: #1435610
    Change-Id: I53175e4a84f2fbeedc056e39f2976c5f1a690fc1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/320605
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=5a9f87c08062d3f0a23116b1a339da3252a69f24
Submitter: Jenkins
Branch: master

commit 5a9f87c08062d3f0a23116b1a339da3252a69f24
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue May 24 20:46:30 2016 +0300

    Gracefully stop if tolerance limit exceeded

    Several changes:

    - support fault tolerance group;
    - support internal stop deployment instead of raise in
      case of error;
    - do not show last run summary debug report from mcollective;
    - fix support of detecting offline nodes before run deployment;
    - support fail on error behavior.

    Support fault tolerance group

      Nailgun send fault tolerance group which inform Astute about
      available number of error nodes in this deployment and importance
      of every node in this task.

    If number of error exceeds number of available errors, deployment
    will stop.

      Support internal stop deployment instead of raise in case of error

      Before this change Astute is end processing, marks all nodes
      as error and do not waiting of puppet process on nodes.

      Now we use same way that used in case of stop deployment.
      Mark failed nodes as error, another nodes as skipped(stopped),
      ready nodes as ready. Also Astute will wait before current
      tasks end.

    Do not show last run summary debug report from mcollective

      For now moment it not so useful, but quickly filled log file
      and difficult debug process

    Fix support of detecting offline nodes before run deployment

      Astute gets response from mcollective to detect node availability.
      If node do not respond, it will mark as failed. It also support
      fault tollerance mechanism

    Support fail on error behavior

      From this moment task which setup fail_on_error if false,
      task marks as skipped instead of failed in case of error.

    Change-Id: Ica2a4ae64b4dfa4f7fccfbc95108d1412c40dc3f
    Closes-Bug: #1435610

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326485

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/326317
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=5e79256424fd82f39506f13a4313cc985765d9d5
Submitter: Jenkins
Branch: stable/mitaka

commit 5e79256424fd82f39506f13a4313cc985765d9d5
Author: Vladimir Kuklin <email address hidden>
Date: Tue May 31 18:10:20 2016 +0300

    Adjust fault tolerance for task groups to zero tolerance for critical roles

    Set fault tolerance to 0 for critical deployment groups

    Related to Change-Id I1969b953eca667c09248a6b67ffee37bfd20f474 and
    Ica2a4ae64b4dfa4f7fccfbc95108d1412c40dc3f

    Change-Id: I5197adc796603dfb40cf1efa57427344b358d353
    Partial-bug: #1435610

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/mitaka)

Reviewed: https://review.openstack.org/326088
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=66b1609df3b3a80a0e9c04ec392a1f4f608601a6
Submitter: Jenkins
Branch: stable/mitaka

commit 66b1609df3b3a80a0e9c04ec392a1f4f608601a6
Author: Bulat Gaifullin <email address hidden>
Date: Mon May 30 19:51:18 2016 +0300

    Added fault_tolerance_group to deployment metadata

    This property contains list of groups, that is built from
    tasks with type 'group' and each task may contain property
    fault_tolerance, that shall be moved from openstack.yaml
    to deployment tasks.
    For plugins this attribute is filled from roles_metadata
    for all tasks with type group (for backward compatibility).

    DocImpact
    Partial-Bug: 1435610
    Change-Id: I1969b953eca667c09248a6b67ffee37bfd20f474

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (stable/mitaka)

Reviewed: https://review.openstack.org/326447
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=df643556114bbf767439cf3d98fa33b615987e50
Submitter: Jenkins
Branch: stable/mitaka

commit df643556114bbf767439cf3d98fa33b615987e50
Author: Alexander Kurenyshev <email address hidden>
Date: Thu Jun 2 17:56:53 2016 +0300

    Add new check for Operational cluster status

    We have a new behaviour when deployment task
    will be in a ready state even when some
    non-important nodes are in an Error state.
    This path adds check for cluster that it's
    in Operational state

    Related-Bug: #1435610
    Change-Id: I53175e4a84f2fbeedc056e39f2976c5f1a690fc1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/326086
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=e66e5f3197ec1b3f641ace2c135c284013b26e75
Submitter: Jenkins
Branch: master

commit e66e5f3197ec1b3f641ace2c135c284013b26e75
Author: Bulat Gaifullin <email address hidden>
Date: Mon Jun 6 21:39:17 2016 +0300

    Reworked calculate_fault_tolerance to make it more clear

    Change-Id: I1a4dd0985ce0d00ef9ed39d7e3fd7895212ba012
    Partial-Bug: 1435610

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/mitaka)

Reviewed: https://review.openstack.org/326485
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=1a8e53cb5bf41b7afc2333cef91b162455ebe1f9
Submitter: Jenkins
Branch: stable/mitaka

commit 1a8e53cb5bf41b7afc2333cef91b162455ebe1f9
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue May 24 20:46:30 2016 +0300

    Gracefully stop if tolerance limit exceeded

    Several changes:

    - support fault tolerance group;
    - support internal stop deployment instead of raise in
      case of error;
    - do not show last run summary debug report from mcollective;
    - fix support of detecting offline nodes before run deployment;
    - support fail on error behavior.

    Support fault tolerance group

      Nailgun send fault tolerance group which inform Astute about
      available number of error nodes in this deployment and importance
      of every node in this task.

    If number of error exceeds number of available errors, deployment
    will stop.

      Support internal stop deployment instead of raise in case of error

      Before this change Astute is end processing, marks all nodes
      as error and do not waiting of puppet process on nodes.

      Now we use same way that used in case of stop deployment.
      Mark failed nodes as error, another nodes as skipped(stopped),
      ready nodes as ready. Also Astute will wait before current
      tasks end.

    Do not show last run summary debug report from mcollective

      For now moment it not so useful, but quickly filled log file
      and difficult debug process

    Fix support of detecting offline nodes before run deployment

      Astute gets response from mcollective to detect node availability.
      If node do not respond, it will mark as failed. It also support
      fault tollerance mechanism

    Support fail on error behavior

      From this moment task which setup fail_on_error if false,
      task marks as skipped instead of failed in case of error.

    Change-Id: Ica2a4ae64b4dfa4f7fccfbc95108d1412c40dc3f
    Closes-Bug: #1435610
    (cherry picked from commit 5a9f87c08062d3f0a23116b1a339da3252a69f24)

Maksym Strukov (unbelll)
tags: added: on-verification
Revision history for this message
Maksym Strukov (unbelll) wrote :

Verified as fixed in 9.0-mos-490

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.