update cancel result in no network interface of nova server

Bug #1693495 reported by huangtianhua
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
In Progress
High
huangtianhua

Bug Description

1. create a stack with a nova server which in network1
2. update the stack to change the server to network2, after several seconds to cancel the update when it is in progress
3. finally the stack is in ROLLBACK_COMPLETE
4. but the server has no any network interface

Because we update the rsrc_defn after update:
*****************
...
yield self.action_handler_task(action,
                               args=[after, tmpl_diff,
                               prop_diff])
self.t = after
self.reparse()
...
****************

If the cancel came before the rsrc_defn updated, the network info will be incorrect:
caseA: cancel right after old interface detach ----then the server will has no any interfaces
caseB: cancel right after new interface attach and before 'self.t = after' ---- then the server has interface in network2, but its template is network1

Changed in heat:
assignee: nobody → huangtianhua (huangtianhua)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/474838

Changed in heat:
status: New → In Progress
Revision history for this message
huangtianhua (huangtianhua) wrote :

in convergence, there is an other problem:
https://bugs.launchpad.net/heat/+bug/1698135

Changed in heat:
importance: Undecided → High
Revision history for this message
Zane Bitter (zaneb) wrote :

There's no fully correct way to solve this class of bugs short of convergence phase 2 (i.e. getting the reality of the current networks from the Nova API, rather than trying to get them from the previous template). There are many, many steps to updating a Nova server resource, and we don't keep track of which ones have and have not been performed (arguably, we should use something like taskflow for this).

That said, this specific issue should be avoided in convergence phase 1 (i.e. when the convergence_engine option is enabled), since we avoid cancelling a resource that is in-progress but rather wait for it to complete before rolling it back.

In the legacy path, we are supposed to allow a grace period (configurable as `error_wait_time` in heat.conf, but 4 minutes by default) before we cancel a resource. So we should only hit this issue if it takes longer than that to update the server resource. It'd be worth checking that that part is working correctly.

Revision history for this message
Zane Bitter (zaneb) wrote :

Actually, it's surprising that the server isn't being replaced in the case where the update is being cancelled, since that should leave it in a FAILED state?

Revision history for this message
huangtianhua (huangtianhua) wrote :

There's no fully correct way to solve this class of bugs short of convergence phase 2
--- I hope the convergency phase 2 can solve the problem, but I meet some issue and can not test it deeper:
1. https://bugs.launchpad.net/heat/+bug/1696897 if we enable observe on update, the logic of getting reality networks and comparing with new network is not correct now
2. https://bugs.launchpad.net/heat/+bug/1698135 if enable convergence, the stack will hang in ROLLBACK_IN_PROGRESS after update cancelling

In the legacy path, we are supposed to allow a grace period (configurable as `error_wait_time` in heat.conf, but 4 minutes by default) before we cancel a resource. So we should only hit this issue if it takes longer than that to update the server resource.
--- Yes, this works if resource.status is IN_PROGRESS. In this bug case, the server is FAILED when cancel message came.

Actually, it's surprising that the server isn't being replaced in the case where the update is being cancelled, since that should leave it in a FAILED state?
--- We introduced a mechanism in ocata, not replace always when resource is FAILED, if the defn is the same, then we won't do anything for nova server resource (the server is ACTIVE in nova)

Revision history for this message
huangtianhua (huangtianhua) wrote :

In the legacy path, we are supposed to allow a grace period (configurable as `error_wait_time` in heat.conf, but 4 minutes by default) before we cancel a resource. So we should only hit this issue if it takes longer than that to update the server resource.
--- Yes, this works if resource.status is IN_PROGRESS. In this bug case, the server maybe FAILED when cancel message came? I am not sure of this.

Revision history for this message
huangtianhua (huangtianhua) wrote :

@Zane,
I looked the code again, maybe there is bug in https://github.com/openstack/heat/blob/master/heat/engine/resource.py#L1140,
we introduced a mechanism to allow individual resources to control the cancellation grace period.
when we process new resource update, the runner key is new resource and the state is INIT_COMPLETE, so when we get the grace period we get None, so the task runner is closed immediately:
https://github.com/openstack/heat/blob/master/heat/engine/update.py#L229
https://github.com/openstack/heat/blob/master/heat/engine/scheduler.py#L401

So maybe we can modify the logic of cancel grade period getting, I will propose a patch for it, would you review it :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/477087

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/477087
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=8c62e96947c70d577a9935a0c3a7b5a92efcbf5f
Submitter: Jenkins
Branch: master

commit 8c62e96947c70d577a9935a0c3a7b5a92efcbf5f
Author: huangtianhua <email address hidden>
Date: Sat Jun 24 18:44:59 2017 +0800

    Get cancellation grace period correctly

    This changes the logic of getting cancellation grace
    period of task runner before closing it: to move the
    liveness check into the cancel_all() method in the
    scheduler rather than ask the resource if it's IN_PROGRESS.

    Change-Id: Ia2a03de227ff15cdce1b3dbb6cd6bff6c5a50a15
    Partial-Bug: 1693495

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/480034

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/ocata)

Reviewed: https://review.openstack.org/480034
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=5034bc10afb736eedca0cb4e0503652711008e23
Submitter: Jenkins
Branch: stable/ocata

commit 5034bc10afb736eedca0cb4e0503652711008e23
Author: huangtianhua <email address hidden>
Date: Sat Jun 24 18:44:59 2017 +0800

    Get cancellation grace period correctly

    This changes the logic of getting cancellation grace
    period of task runner before closing it: to move the
    liveness check into the cancel_all() method in the
    scheduler rather than ask the resource if it's IN_PROGRESS.

    Change-Id: Ia2a03de227ff15cdce1b3dbb6cd6bff6c5a50a15
    Partial-Bug: 1693495
    (cherry picked from commit 8c62e96947c70d577a9935a0c3a7b5a92efcbf5f)

tags: added: in-stable-ocata
Rico Lin (rico-lin)
Changed in heat:
milestone: none → no-priority-tag-bugs
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.