TripleO Deployments get stuck sometimes until a timeout is reached

Bug #1488366 reported by Removed by request
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Steve Baker
Kilo
Fix Released
High
Unassigned
tripleo
Fix Released
High
Unassigned

Bug Description

During deployment, some jobs would simply get stuck and would not progress. After the configured timeout was reached, the stack will get a CREATE_FAILED. Trying to update the stack will actually redo the 'stuck' jobs and the deployment might succeed after that.

Tags: tripleo
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

This has identical symptoms to bug 1477329, but the root cause is apparently not the same race.

Changed in heat:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Steve Baker (steve-stevebaker)
Changed in heat:
importance: Medium → High
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I think I've found the root cause. Looking the logs for a failed overcloud this issue happens when a deployment is added to a server at the same time as other deployments are signalling.

EngineService.resource_signal has logic which calls metadata_update on *every* resource in the stack after a signal. For a server resource, this results in an update query to rsrc_metadata which bypasses the stack locking, so it can override the update done by SoftwareConfigService._push_metadata_software_deployments

Given that calling metadata_update on all resources is legacy behaviour to support wait conditions, the best fix may be to have a way of indicating whether a given signal should result in metadata_update calls on the other resources

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/216920

Changed in heat:
status: Triaged → In Progress
Changed in heat:
milestone: none → liberty-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/216920
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=b8f38204148e55a1a9f335251f915cc8f47b4999
Submitter: Jenkins
Branch: master

commit b8f38204148e55a1a9f335251f915cc8f47b4999
Author: Steve Baker <email address hidden>
Date: Wed Aug 26 13:30:09 2015 +1200

    Don't metadata_update all resources for deployment signals

    This fixes a race when a call to
    SoftwareConfigService._push_metadata_software_deployments is in
    progress at the same time as other deployments are signalling heat.

    The signal triggers a metadata_update() on the server resource
    which results in rsrc_metadata being updated without the resource
    lock. This can overwrite the rsrc_metadata being written by
    SoftwareConfigService._push_metadata_software_deployments resulting in
    stale metadata.

    Change-Id: I081bc154ed7e79f4a4258c846857b3f39cc7887c
    Closes-Bug: #1488366

Changed in heat:
status: In Progress → Fix Committed
Changed in heat:
status: Fix Committed → Fix Released
Zane Bitter (zaneb)
tags: added: kilo-backport-potential
Changed in tripleo:
status: New → Fix Released
importance: Undecided → High
Angus Salkeld (asalkeld)
tags: removed: kilo-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/225512

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/kilo)

Reviewed: https://review.openstack.org/225512
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=2bffb1273c8f8de8b9384382800a9aae2f53bad6
Submitter: Jenkins
Branch: stable/kilo

commit 2bffb1273c8f8de8b9384382800a9aae2f53bad6
Author: Angus Salkeld <email address hidden>
Date: Fri Sep 25 11:25:12 2015 +1000

    Don't metadata_update all resources for deployment signals

    This fixes a race when a call to
    SoftwareConfigService._push_metadata_software_deployments is in
    progress at the same time as other deployments are signalling heat.

    The signal triggers a metadata_update() on the server resource
    which results in rsrc_metadata being updated without the resource
    lock. This can overwrite the rsrc_metadata being written by
    SoftwareConfigService._push_metadata_software_deployments resulting in
    stale metadata.

    Note: heat/tests/engine/test_stack_snapshot.py was not getting tested
    as there was no __init__.py. This test file seems to require code
    that is not in stable/kilo.

    Change-Id: I081bc154ed7e79f4a4258c846857b3f39cc7887c
    Closes-Bug: #1488366
    (cherry picked from commit b8f38204148e55a1a9f335251f915cc8f47b4999)

Thierry Carrez (ttx)
Changed in heat:
milestone: liberty-3 → 5.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.