CI: Deployment fails on controller timeout with task creation in heat timing out

Bug #1731032 reported by Alex Schultz
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Critical
Thomas Herve
tripleo
Fix Released
Critical
Unassigned

Bug Description

2017-11-08 17:48:09 | 2017-11-08 17:48:04Z [overcloud.Controller]: CREATE_FAILED CREATE aborted (Task create from ResourceGroup "Controller" Stack "overcloud" [4d2b6c7a-427c-4969-ac68-0ed51e7e568b] Timed out)
2017-11-08 17:48:09 | 2017-11-08 17:48:04Z [overcloud.Controller]: UPDATE_FAILED Stack UPDATE cancelled

http://logs.openstack.org/38/517038/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f6ed2aa/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2017-11-08_17_48_09

On the controller I noticed that the last thing was a notification back to heat
http://logs.openstack.org/38/517038/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f6ed2aa/logs/subnode-2/var/log/messages.txt.gz#_Nov__8_16_35_19

On the undercloud, we see that is the last thing in the heat api cfn logs
http://logs.openstack.org/38/517038/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f6ed2aa/logs/undercloud/var/log/heat/heat_api_cfn.log.txt.gz#_2017-11-08_16_35_19_237

In the heat engine it seems to start looping:
http://logs.openstack.org/38/517038/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f6ed2aa/logs/undercloud/var/log/heat/heat-engine.log.txt.gz#_2017-11-08_16_35_19_206

Revision history for this message
wes hayutin (weshayutin) wrote :

adding promotion blocker as this is failing in the tripleo gate in addition to the check jobs.

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Revision history for this message
Jiří Stránský (jistr) wrote :

In `sudo os-collect-config --print` output there's no UpdateDeployment mentioned, and in `ps` output os-collect-config has no running child processes that would block it from fetching new metadata:

http://logs.openstack.org/38/517038/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f6ed2aa/logs/subnode-2/var/log/host_info.txt.gz

Postci says:

overcloud Still in progress
0 Still in progress
overcloud-Controller-btpsh5fttbpk-0-ogn43su3wg5g Still in progress
UpdateDeployment Still in progress
overcloud.Controller.0:
  resource_type: OS::TripleO::Controller
  physical_resource_id: fecba6f4-3448-48bd-b404-d90c8c94bfb8
  status: CREATE_FAILED
  status_reason: |
    resources[0]: Stack CREATE cancelled

http://logs.openstack.org/38/517038/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f6ed2aa/logs/undercloud/var/log/postci.txt.gz

So it looks like there's desync between what Heat thinks should be happening on the controller node, vs. what os-collect-config is doing. Heat is waiting for UpdateDeployment to start, but os-collect-config doesn't have the metadata. Either there's some polling issue or issue with publishing the metadata.

wes hayutin (weshayutin)
tags: added: promotion-blocker
Revision history for this message
Alan Pevec (apevec) wrote :

@jistr could this be same root case with metadata as in bug 1731063 ?

Revision history for this message
Alan Pevec (apevec) wrote :

<jistr> apevec: we discussed that yesterday with therve, likely they're not the same. One is Nova metadata, the other is Heat metadata.

Revision history for this message
Attila Darazs (adarazs) wrote :
Revision history for this message
Alex Schultz (alex-schultz) wrote :
Revision history for this message
Alex Schultz (alex-schultz) wrote :

So from the failed logs I noticed a looping for the following for the entire run.

2017-11-16 20:51:24.309 4886 DEBUG heat.engine.scheduler [req-9d6de4f1-772c-42ad-9095-be1add850c3a - - - - -] Task create from TemplateResource "0" Stack "overcloud-Controller-qlnrw2w555we" [e0638fae-2771-447b-8737-94802bc98d05] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:209
2017-11-16 20:51:24.314 4886 DEBUG heat.engine.scheduler [req-9d6de4f1-772c-42ad-9095-be1add850c3a - - - - -] Task create from TemplateResource "0" Stack "overcloud-Controller-qlnrw2w555we" [e0638fae-2771-447b-8737-94802bc98d05] sleeping _sleep /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:150

In a successful run this seem to stop with an update traversal log line.

http://logs.openstack.org/77/519577/5/check/legacy-tripleo-ci-centos-7-scenario002-multinode-oooq-container/5bbf7e7/logs/undercloud/var/log/heat/heat-engine.log.txt.gz#_2017-11-14_16_28_41_870

2017-11-14 16:28:41.870 1174 DEBUG heat.engine.sync_point [req-90ec1356-1161-42a8-9421-e185dbe0b7ea - admin - default default] [d3a6d095-e9f7-4627-80cc-2f1a86d78c0f:2ebdac61-f678-41bb-8392-02194e1952d3:True] Ready d3a6d095-e9f7-4627-80cc-2f1a86d78c0f: Got (340, True) sync /usr/lib/python2.7/site-packages/heat/engine/sync_point.py:145
2017-11-14 16:28:41.870 1174 INFO heat.engine.stack [req-90ec1356-1161-42a8-9421-e185dbe0b7ea - admin - default default] [overcloud-Controller-o4rwrff2glyq(d3a6d095-e9f7-4627-80cc-2f1a86d78c0f)] update traversal 2ebdac61-f678-41bb-8392-02194e1952d3 complete
2017-11-14 16:28:41.871 1174 INFO heat.engine.stack [req-90ec1356-1161-42a8-9421-e185dbe0b7ea - admin - default default] Stack UPDATE COMPLETE (overcloud-Controller-o4rwrff2glyq): Stack UPDATE completed successfully

Revision history for this message
Alex Schultz (alex-schultz) wrote :

So to continue from comment #10, that appears to be because ControllerUpgradeInitDeployment never finishes but I can't find what is it still waiting on

Revision history for this message
Thomas Herve (therve) wrote :

I'm starting to narrow it down to an area. We go up to NetworkDeployment. Once it's signaled, 2 deployments are triggered from the stack: ControllerUpgradeInitDeployment, and UpdateDeployment (as both depend on NetworkDeployment). Then those are supposed to update The swift temp_url with the deployment data. What seems to happen is that they step on each other, and the second one to run override the first. We can see it in the swift logs:

Nov 16 19:44:10 centos-7-citycloud-lon1-0000925781 proxy-server[7670]: 192.168.24.1 192.168.24.1 16/Nov/2017/19/44/10 PUT /v1/AUTH_0b280f98fead4055bce9a4b8860998b5/ov-ntroller-leqby5yewtov-deployed-server-e6phgktmis65/ff613249-3912-453b-beb0-8710c4af0b22%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3Dbeeb444483649059ba3ee9364d672101a0a9ff50 HTTP/1.0 201 - python-requests/2.14.2 - 36588 - - txad48021f81c040aeabb2a-005a0dea8a - 0.1672 - - 1510861450.004256964 1510861450.171437025 0
Nov 16 19:44:10 centos-7-citycloud-lon1-0000925781 proxy-server[7671]: 192.168.24.1 192.168.24.1 16/Nov/2017/19/44/10 PUT /v1/AUTH_0b280f98fead4055bce9a4b8860998b5/ov-ntroller-leqby5yewtov-deployed-server-e6phgktmis65/ff613249-3912-453b-beb0-8710c4af0b22%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3Dbeeb444483649059ba3ee9364d672101a0a9ff50 HTTP/1.0 201 - python-requests/2.14.2 - 34811 - - txf4d5aa9a0b384d0a87ef5-005a0dea89 - 0.5486 - - 1510861449.966480970 1510861450.515069008 0

Size goes from 36588 to 34811 (the size of the JSON blob delivered to the instance).

https://github.com/openstack/heat/blob/master/heat/engine/service_software_config.py#L108 is where we retrieve the deployments. We're doing a metadata update to synchronize, but something wrong is happening.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Thomas, so if we have a race condition, would it make sense to move the UpdateDeployment dependency in a consequent fashion instead of a concurrent? Like in this picture attached :)

Revision history for this message
Thomas Herve (therve) wrote :

Created https://review.openstack.org/521064 to get some debug logs, though I guess it needs to get through promotion first.

Bogdan: that might be a workaround yes. I hope we'll find a fix before needing to do that.

Revision history for this message
Zane Bitter (zaneb) wrote :

It would be interesting to be able to see the metadata in the Heat database (perhaps Thomas's patch will tell us this).

To me it looks like the database update code is very robust. There's a key that increments atomically every time we change the metadata, and if something else has changed it we reload and try again.

What does *not* look robust is where we push the data into Swift or Zaqar. Once we've updated the metadata in the DB we push the latest data to Swift or Zaqar with no locking whatsoever, and thus no way to check whether there's a race against a later update. If the later update gets written first, stale data will overwrite it.

Prior to convergence this could probably only happen when the two software deployments were in separate stacks or when the deployments were created via the REST API, because operations on resources in a single stack happen in the same thread. (The actual creation of the software deployment happens in a separate context, but the call to create it is synchronous.) With the improved concurrency in convergence, it's possible for two software deployments in the same stack to be racing to update Swift.

Changed in heat:
importance: Undecided → Critical
status: New → Triaged
milestone: none → queens-2
Revision history for this message
Zane Bitter (zaneb) wrote :

On reflection, I don't think the Zaqar transport is affected because we don't overwrite the metadata, just send another message. It doesn't appear to matter whether the messages arrive in-order, because os-collect-config stores every deployment it receives in a message (and triggers os-refresh-config each time); it doesn't delete old deployments nor look only at the most recent message.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/521204

Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
status: Triaged → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I propose to use a more reliable backend instead (redis/mongo?), not Swift. It seems that swift is only good for non-convergent Heat stacks.

Changed in heat:
assignee: Zane Bitter (zaneb) → Steve Baker (steve-stevebaker)
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I think Zane's race fix will help here, see my review on the change.

Changed in heat:
assignee: Steve Baker (steve-stevebaker) → Zane Bitter (zaneb)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/521468

Changed in heat:
assignee: Zane Bitter (zaneb) → Thomas Herve (therve)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Zane Bitter (<email address hidden>) on branch: master
Review: https://review.openstack.org/521204
Reason: I started writing this before I realised that Zaqar wasn't also affected. Since it's just Swift, Thomas's patch I58d3af3d1de481b3d314313c9f39255504792634 is a tidier solution. Abandoning in favour of that.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Thomas Herve (<email address hidden>) on branch: master
Review: https://review.openstack.org/521064

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/521468
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=746a122d81eda0de4bf57a0a48f2b663a0c45140
Submitter: Zuul
Branch: master

commit 746a122d81eda0de4bf57a0a48f2b663a0c45140
Author: Thomas Herve <email address hidden>
Date: Mon Nov 20 10:51:13 2017 +0100

    Check swift deployments before update

    Before making a software deployments update in the Swift backend, check
    the data content to operate as a transaction.

    Change-Id: I58d3af3d1de481b3d314313c9f39255504792634
    Closes-Bug: #1731032

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
Thomas Herve (therve) wrote :

The fix is merged, so I remove the alert. We need a promotion to see if the issue disappears.

tags: removed: alert
Changed in tripleo:
milestone: queens-2 → queens-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 10.0.0.0b2

This issue was fixed in the openstack/heat 10.0.0.0b2 development milestone.

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Thomas Herve (therve) wrote :

Restored the bug, it's not fixed.

Changed in heat:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/521204
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=46132999c016ed44972b9ac87a53946e9d8a7111
Submitter: Zuul
Branch: master

commit 46132999c016ed44972b9ac87a53946e9d8a7111
Author: Zane Bitter <email address hidden>
Date: Fri Nov 17 16:15:05 2017 -0500

    Resolve race in providing deployment data to Swift

    While we are careful to ensure there are no race conditions in updating
    the software deployment metadata in the Heat database, there was no such
    protection in place for the actual update of the metadata provided to
    the server via Swift. In legacy Heat, this race could only occur when
    software deployments were created from the REST API or from separate
    stacks deploying software to the same server, because resources in the
    same stack were handled by a single thread making synchronous calls to
    create the software deployments. However, with convergence, resources in
    the same stack not only can be processed concurrently but will tend to
    be synchronised if they have the same dependencies (since they'll be
    triggered by consecutive asynchronous RPC messages).

    This handles the race by retrying if a concurrent update is detected by
    the time the data has been provided to Swift.

    Note that this is only guaranteed to result in the latest data being
    available on the assumption that a later write to Swift cannot overwrite
    an earlier one that has been accepted. This is probably *not* guaranteed
    by Swift in the presence of clock skew. Nonetheless, this should
    drastically reduce the failure rate.

    Similar caveats would probably apply to Zaqar, although Zaqar nominally
    provides FIFO queues (in practice, it likely depends on the backend
    used). However, with Zaqar clients receive all updates, not just the
    latest one, and os-collect-config stores the deployments contained in
    every message. Thus, Zaqar is not affected even assuming out-of-order
    arrival.

    Change-Id: Ic9a447f27e8c51f91f47f93b0fd1e9710341ec38
    Closes-Bug: #1731032

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/557311

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/queens)

Reviewed: https://review.openstack.org/557311
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=8819c6380f242d5fd28ef85bfc63ad53f074aaad
Submitter: Zuul
Branch: stable/queens

commit 8819c6380f242d5fd28ef85bfc63ad53f074aaad
Author: Zane Bitter <email address hidden>
Date: Fri Nov 17 16:15:05 2017 -0500

    Resolve race in providing deployment data to Swift

    While we are careful to ensure there are no race conditions in updating
    the software deployment metadata in the Heat database, there was no such
    protection in place for the actual update of the metadata provided to
    the server via Swift. In legacy Heat, this race could only occur when
    software deployments were created from the REST API or from separate
    stacks deploying software to the same server, because resources in the
    same stack were handled by a single thread making synchronous calls to
    create the software deployments. However, with convergence, resources in
    the same stack not only can be processed concurrently but will tend to
    be synchronised if they have the same dependencies (since they'll be
    triggered by consecutive asynchronous RPC messages).

    This handles the race by retrying if a concurrent update is detected by
    the time the data has been provided to Swift.

    Note that this is only guaranteed to result in the latest data being
    available on the assumption that a later write to Swift cannot overwrite
    an earlier one that has been accepted. This is probably *not* guaranteed
    by Swift in the presence of clock skew. Nonetheless, this should
    drastically reduce the failure rate.

    Similar caveats would probably apply to Zaqar, although Zaqar nominally
    provides FIFO queues (in practice, it likely depends on the backend
    used). However, with Zaqar clients receive all updates, not just the
    latest one, and os-collect-config stores the deployments contained in
    every message. Thus, Zaqar is not affected even assuming out-of-order
    arrival.

    Change-Id: Ic9a447f27e8c51f91f47f93b0fd1e9710341ec38
    Closes-Bug: #1731032
    (cherry picked from commit 46132999c016ed44972b9ac87a53946e9d8a7111)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 11.0.0.0b1

This issue was fixed in the openstack/heat 11.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 10.0.1

This issue was fixed in the openstack/heat 10.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers