All nodes in error state after scaling because one compute node was unreachable

Bug #1502295 reported by Mykola Grygoriev on 2015-10-02
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Maciej Kwiek
MOS Maintenance
MOS Maintenance

Bug Description

Fuel 6.1.

Short description:
Customer successfully deployed cloud with 20+ compute nodes. He tried to add one more compute node day or two later. Compute node was successfully deployed and then astute task failed on 'uploadfile' step, because 1 compute node was unavailable that time and mcollective agent couldn't reach it. After this astute marked all nodes as "error" and set cloud status to error.

Customer plans to use a lot of compute nodes, so one of compute nodes could be unreachable when he will scale up cloud. And by the way, unavailability of one or two compute nodes doesn't affect whole cloud.

Steps to reproduce:
1. Deploy cloud with 1 controller and 2 compute nodes.
2. Make 1 compute node unreachable.
3. Scale up your cloud with 1 more compute node.

Current result:
All nodes after scale up when 1 compute node is unreachable will be in error state.

Expected result.
Only unreachable node after scale up when 1 compute node is unreachable will be in error state.

Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Python Team (fuel-python)
milestone: none → 6.1-updates
Dmitry Pyzhov (dpyzhov) on 2015-10-06
tags: removed: critical
Dmitry Pyzhov (dpyzhov) on 2015-10-08
tags: added: tricky
no longer affects: fuel/8.0.x
Maciej Kwiek (maciej-iai) wrote :

There is a workaround for this bug: when the node goes offline, you should remove it (there is an option for removing offline nodes in web ui). After the offline node is removed, you are able to deploy any new changes.

Maciej Kwiek (maciej-iai) wrote :

There should be a warning in UI (or CLI) if you are running deployment with offline, not removed nodes.

Andrew Woodward (xarses) wrote :

I don't think this properly addresses the problem. The root issue here is that in a multi-node task if one node fails all nodes in the task are marked as failed. It happens when the task it's self fails too. In the event of a task being run on a production cloud it sets the entire cloud to failed. After this the orchestrator want's to re-run all task on all nodes to resolve it.

This is further compounded by the start of a task removing the pending state, not the completion.

Bottom line, only the node(s) failed in a task should be marked as error, and only the not completed tasks should be identified to run the next time changes are deployed.

Maciej Kwiek (maciej-iai) wrote :

As I see it - my patch fixes the bug, but it doesn't resolve the root cause which is lack of fault tolerance in post-deployment phase. I think this issue should be handled in separate, more general launchpad bug/blueprint.

Ihor Kalnytskyi (ikalnytskyi) wrote :


We can't address the issue "do not mark all nodes in error state" right now, since it's our limitation. I mean, in post deployment stage we have tasks which are critical for clusters (such as enable_quorum) as well as not critical (upload cirros or update host).

So if post deployment task has been failed, we mark entire deployment in error state, because we can't say whether cluster is operational or not. I think we can go with @Maciej's fix for now, and take in mind for general solution that should be addressed as a blueprint.


Just come to mind, what do you think if we also mark **offline** nodes in **error**, so user will notice that updates wasn't applied there? It's ugly, but will notify a cluster operator that redeployment is needed for these nodes.

Change abandoned by Maciej Kwiek (<email address hidden>) on branch: master
Reason: After discussing this with loles, the change needs to be done in Astute.

Dmitry Pyzhov (dpyzhov) on 2015-10-22
tags: added: area-python

Submitter: Jenkins
Branch: master

commit d10e78b0b03e64751adf197c4a5921fb2430059c
Author: Maciej Kwiek <email address hidden>
Date: Wed Oct 14 11:15:47 2015 +0200

    All offline nodes are removed as failed nodes

    remove_failed_nodes took only newly deployed nodes uids into
    consideration for checking for offline nodes. Now all nodes in cluster
    are checked for being available.

    Change-Id: Ifbdec3d6f8cd1b2751afb45c185efd5c5316a817
    Closes-bug: #1502295

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification

Fuel 8.0 has been verified on ISO #185

[root@nailgun ~]# cat /etc/fuel/version.yaml
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "185"
  build_id: "185"
  fuel-nailgun_sha: "7d7366c2ec9b46e4ac90d9c6d3c9e7b87e40ac14"
  python-fuelclient_sha: "e685d68c1c0d0fa0491a250f07d9c3a8d0f9608c"
  fuel-agent_sha: "6f3026d8c8e0927ee8fdf9d3171d506674cc7130"
  fuel-nailgun-agent_sha: "16f5c1a1575a6b482f5159dd2e4b255c03167a7e"
  astute_sha: "c8400f51b0b92254da206de55ef89d17fdf35393"
  fuel-library_sha: "9e565fa8550c78e6391e1da10c07f8be3d329dec"
  fuel-ostf_sha: "c2e1fa0ca859c163a7ff445a70f1264d6be0893b"
  fuel-createmirror_sha: "994fed9b1ed889718b61a59733275c08c2dd4c64"
  fuelmenu_sha: "d12061b1aee82f81b3d074de74ea27a6e962a686"
  shotgun_sha: "c377d163519f6d10b69a654019d6086ba5f14edc"
  network-checker_sha: "a57e1d69acb5e765eb22cab0251c589cd76f51da"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "cd084cf5c4372a46184fb7c2f24568da4e030be2"

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released

Guys, this patch breaks Reduced Footprint re-installation case where we always have offline controller node during the compute's re-installation process.

Change abandoned by Tony Breeds (<email address hidden>) on branch: stable/6.1
Reason: This branch (stable/6.1) is at End Of Life

Change abandoned by Tony Breeds (<email address hidden>) on branch: stable/7.0
Reason: This branch (stable/7.0) is at End Of Life

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers