Instances are not rescheduled after deploy fails

Bug #1671648 reported by Charles Volzka on 2017-03-09
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Shunli Zhou
Ocata
High
Matt Riedemann

Bug Description

Steps to reproduce:
Pre-step. Need to force the deploy to fail in such a way that it can be rescheduled. For testing I just forced it to fail by adding raise nova.exception.ComputeResourcesUnavailable('forced failure') during the instance spawn on the host.
1. Make sure environment is set to retry failed deploys.
2. Attempt to deploy VM and wait for it to fail.

Expected result:
Failed instance is rescheduled and attempted on another host.

Actual result:
Deploy fails but is not rescheduled.

I am just beginning to experiment with ocata build from early March. I found that when an instance fails to deploy and throws a RescheduledException, it is not getting rescheduled as expected. The problem appears to be that the filter_properties['retry'] is not getting set during initial deploy.

On initial deploy nova.conductor.manager.schedule_and_build_instances() schedules the build_request and creates the instance object. That method also creates the filter properties (filter_props) that is passed on to compute_rpcapi.build_and_run_instance(). The problem is that scheduler_utils.populate_retry() is not called before the filter_props is passed on to the build call. When the deploy later fails on the host nova.compute.manager._do_build_and_run_instance() catches the RescheduledException but does not try and reschedule it because filter_properties.get('retry') returns None.

In the past it looks like populate_retry() was called in by nova.conductor.manager.build_instances() during the initial deploy. I'm not seeing build_instances() get called during initial deploy after switching to ocata. As an experiment I added scheduler_utils.populate_retry(filter_props, build_request.instance_uuid) immediately after filter_props is set in schedule_and_build_instances(). Afterward I do see the instances get rescheduled. I also noticed nova.conductor.manager.build_instances() gets called for each attempt after the first.

Shunli Zhou (shunliz) on 2017-03-10
Changed in nova:
status: New → Confirmed
assignee: nobody → Shunli Zhou (shunliz)
Shunli Zhou (shunliz) wrote :

tried on devstack.
nova / nova/conductor/manager.py:schedule_and_build_instances

filter_props = request_spec.to_legacy_filter_properties_dict()
scheduler_utils.populate_filter_properties(filter_props,
                                                       host)

The filter_pros here is an empty dict,so nothing is done in _add_retry_host and the retry feature is broken.

I will investigate this and work on this problem.

Fix proposed to branch: master
Review: https://review.openstack.org/444106

Changed in nova:
status: Confirmed → In Progress
tags: added: conductor
Changed in nova:
importance: Undecided → High
Matt Riedemann (mriedem) on 2017-03-13
tags: added: ocata-backport-potential
Changed in nova:
assignee: Shunli Zhou (shunliz) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2017-03-16
Changed in nova:
assignee: Matt Riedemann (mriedem) → Shunli Zhou (shunliz)

Reviewed: https://review.openstack.org/446209
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=72e1506101b131b51fbe77acc0af19f36899c28d
Submitter: Jenkins
Branch: master

commit 72e1506101b131b51fbe77acc0af19f36899c28d
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 15 16:58:11 2017 -0400

    Add a functional regression/recreate test for bug 1671648

    This adds a test which recreates the regression bug introduced
    in Ocata where build retries are not populated when creating
    instances in conductor for cells v2.

    The change that fixes the bug will go on top of this and modify
    the test to show the bug is fixed.

    Change-Id: Ie9e955d79b4e1441092183135b3f70b003c94db5
    Related-Bug: #1671648

Reviewed: https://review.openstack.org/444106
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc
Submitter: Jenkins
Branch: master

commit cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc
Author: ShunliZhou <email address hidden>
Date: Fri Mar 10 14:05:57 2017 +0800

    Add populate_retry to schedule_and_build_instances

    When boot an instance and failed on the compute node, nova will
    not retry to boot on other host.

    Since https://review.openstack.org/#/c/319379/ change the create
    instance workflow and called schedule_and_build_instances which
    not populate the retry into filter properties. So nova will not
    retry when boot on compute fail. This patch populate retry to
    instance properties when call schedule_and_build_instances.

    Change-Id: Ifdaddcd265a7fe8282499e27043936f8212610ad
    Closes-Bug: #1671648

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/446685
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b6b5438c3ddeedab6c7f83d1998d283f1bb503bc
Submitter: Jenkins
Branch: master

commit b6b5438c3ddeedab6c7f83d1998d283f1bb503bc
Author: melanie witt <email address hidden>
Date: Thu Mar 16 18:24:23 2017 +0000

    Fix functional regression/recreate test for bug 1671648

    There are a couple of issues with the test:

      1. It doesn't consider both hosts from the two compute services
         during scheduling.

      2. There is a race where sometimes claims.Claim.__init__ won't
         be called because if the RT instance_claim runs before
         update_available_resource has run, it will create a
         claims.NopClaim instead.

    This adds the RetryFilter to enabled_filters, adds set_nodes() calls
    to set the nodenames of each compute service to match its host,
    resulting in consideration of both hosts for scheduling, and stubs
    resource_tracker.ResourceTracker.instance_claim instead of
    claims.Claim.__init__.

    Related-Bug: #1671648

    Change-Id: I541c03a7960b8f135b005c43cb5c7bcb0b63b9ae

Reviewed: https://review.openstack.org/446261
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=26b3a749530cc4b56921ab9b4016a9346991b9ca
Submitter: Jenkins
Branch: stable/ocata

commit 26b3a749530cc4b56921ab9b4016a9346991b9ca
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 15 16:58:11 2017 -0400

    Add a functional regression/recreate test for bug 1671648

    This adds a test which recreates the regression bug introduced
    in Ocata where build retries are not populated when creating
    instances in conductor for cells v2.

    The change that fixes the bug will go on top of this and modify
    the test to show the bug is fixed.

    Change-Id: Ie9e955d79b4e1441092183135b3f70b003c94db5
    Related-Bug: #1671648
    (cherry picked from commit 72e1506101b131b51fbe77acc0af19f36899c28d)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/447014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6304edf6082a8e36aabaa2a927b4a14da9df2402
Submitter: Jenkins
Branch: stable/ocata

commit 6304edf6082a8e36aabaa2a927b4a14da9df2402
Author: melanie witt <email address hidden>
Date: Thu Mar 16 18:24:23 2017 +0000

    Fix functional regression/recreate test for bug 1671648

    There are a couple of issues with the test:

      1. It doesn't consider both hosts from the two compute services
         during scheduling.

      2. There is a race where sometimes claims.Claim.__init__ won't
         be called because if the RT instance_claim runs before
         update_available_resource has run, it will create a
         claims.NopClaim instead.

    This adds the RetryFilter to enabled_filters, adds set_nodes() calls
    to set the nodenames of each compute service to match its host,
    resulting in consideration of both hosts for scheduling, and stubs
    resource_tracker.ResourceTracker.instance_claim instead of
    claims.Claim.__init__.

    Conflicts:
     nova/tests/functional/regressions/test_bug_1671648.py

    NOTE(mriedem): The conflict is due to this patch coming after
    cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc in Pike. Since this
    is a fix for the functional test that the bug fix builds on,
    we actually want this to come *before* the bug fix backport.

    Related-Bug: #1671648

    Change-Id: I541c03a7960b8f135b005c43cb5c7bcb0b63b9ae
    (cherry picked from commit b6b5438c3ddeedab6c7f83d1998d283f1bb503bc)

Reviewed: https://review.openstack.org/446262
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4e3be434bd0b6f7c0add4e210e1f674b80fe54f4
Submitter: Jenkins
Branch: stable/ocata

commit 4e3be434bd0b6f7c0add4e210e1f674b80fe54f4
Author: ShunliZhou <email address hidden>
Date: Fri Mar 10 14:05:57 2017 +0800

    Add populate_retry to schedule_and_build_instances

    When boot an instance and failed on the compute node, nova will
    not retry to boot on other host.

    Since https://review.openstack.org/#/c/319379/ change the create
    instance workflow and called schedule_and_build_instances which
    not populate the retry into filter properties. So nova will not
    retry when boot on compute fail. This patch populate retry to
    instance properties when call schedule_and_build_instances.

    Conflicts:
     nova/tests/functional/regressions/test_bug_1671648.py

    NOTE(mriedem): The conflict is due to putting the functional
    test fix before this bug fix in the backport series.

    Change-Id: Ifdaddcd265a7fe8282499e27043936f8212610ad
    Closes-Bug: #1671648
    (cherry picked from commit cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc)

This issue was fixed in the openstack/nova 15.0.2 release.

This issue was fixed in the openstack/nova 16.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers