Scale down of compute node fails if nova-compute is not running properly

Bug #1860694 reported by Sai Sindhur Malleni
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Undecided
Sai Sindhur Malleni

Bug Description

During overcloud node delete to delete a compute node, we assume the that nova-compute is enabled on the node and is working as expected. However, the node scale down
fails in cases where the node being scaled is not correctly
behaving as a compute node (nova containers not running/reporting
to the overcloud). This patch includes a check to only disable
and stop nova services if they are running.

When running /var/lib/mistral/overcloud/scale_steps_tasks.yaml for scale down, we see

TASK [Disable nova-compute service] ********************************************
Thursday 23 January 2020 15:20:58 +0000 (0:00:00.087) 0:01:52.056 ******
fatal: [overcloud-r630compute-0]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: list object has no element 0\n\nThe error appears to be in '/var/lib/mistral/overcloud/scale_steps_tasks.yaml': line 24, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n - (nova_compute_service | length) > 1\n - check_mode: false\n ^ here\n"}

due to compute service not properly running on the node we are trying to remove.

We need better handling of this case.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/704037

Changed in tripleo:
assignee: nobody → Sai Sindhur Malleni (smalleni)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/704037
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=119769384f944e20b0f11c86ed68c5ffeb8385c5
Submitter: Zuul
Branch: master

commit 119769384f944e20b0f11c86ed68c5ffeb8385c5
Author: Sai Sindhur Malleni <email address hidden>
Date: Thu Jan 23 11:49:28 2020 -0500

    Check to make sure compute service is deployed before scale down

    Currently during a node scale down using openstack overcloud
    node delete, we assume the that nova-compute is enabled on the
    node and is working as expected. However, the node scale down
    fails in cases where the node being scaled is not correctly
    behaving as a compute node (nova containers not running/reporting
    to the overcloud). This patch includes a check to only disable
    and stop nova services if they are running. We ran into this
    scenario when we wanted to scale down a node that did not cleanly
    deploy as a compute node due to failure in step 5 in a large scale
    environment.

    Change-Id: Ic8225af65c409b6a32d4bb2def370c7c802147fa
    Co-Authored-By: Luke Short <email address hidden>
    Closes-Bug: #1860694
    Signed-off-by: Sai Sindhur Malleni <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/704220

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/704220
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=557c0c358f9b9eacc74f1bfe169af7cfa2c18cd0
Submitter: Zuul
Branch: stable/train

commit 557c0c358f9b9eacc74f1bfe169af7cfa2c18cd0
Author: Sai Sindhur Malleni <email address hidden>
Date: Thu Jan 23 11:49:28 2020 -0500

    Check to make sure compute service is deployed before scale down

    Currently during a node scale down using openstack overcloud
    node delete, we assume the that nova-compute is enabled on the
    node and is working as expected. However, the node scale down
    fails in cases where the node being scaled is not correctly
    behaving as a compute node (nova containers not running/reporting
    to the overcloud). This patch includes a check to only disable
    and stop nova services if they are running. We ran into this
    scenario when we wanted to scale down a node that did not cleanly
    deploy as a compute node due to failure in step 5 in a large scale
    environment.

    Change-Id: Ic8225af65c409b6a32d4bb2def370c7c802147fa
    Co-Authored-By: Luke Short <email address hidden>
    Closes-Bug: #1860694
    Signed-off-by: Sai Sindhur Malleni <email address hidden>
    (cherry picked from commit 119769384f944e20b0f11c86ed68c5ffeb8385c5)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.1.0

This issue was fixed in the openstack/tripleo-heat-templates 12.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.