Concurrent DB transaction when deleting resources

Bug #1642166 reported by Jan Provaznik
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Triaged
Medium
Unassigned

Bug Description

After successfully deploying openshift-on-openstack templates (https://github.com/redhat-openstack/openshift-on-openstack) with 50 nodes, when I try to delete the stack (or scale down) it fails because of concurrent DB transaction error bellow.

I tried with this patch applied:
https://review.openstack.org/#/c/321783/
But no luck.

Also tried this:
[root@overcloud-controller-2 ~]# diff ~/resource.py /usr/lib/python2.7/site-packages/heat/objects/resource.py
41,42c41,43
< wrapper = retrying.retry(stop_max_attempt_number=11,
< wait_random_min=0.0, wait_random_max=2.0,
---
> wrapper = retrying.retry(stop_max_attempt_number=21,
> wait_random_min=0.0, wait_random_max=1000,
> wait_exponential_multiplier=2,

But same error.

2016-11-16 08:14:30.428 52982 INFO heat.engine.resource [req-905b301b-9cb8-429e-9d29-c1702ceb3441 - - - - -] DELETE: TemplateResource "mz6kpmaz34kh" [d3cb75c5-9731-4898-bd11-a74f744a394f] St
ack "test-openshift_nodes-fkqtfy673zzm" [4377be7b-f8af-4215-89ac-030aade77b20]
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource Traceback (most recent call last):
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 743, in _action_recorder
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource yield
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1659, in delete
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource *action_args)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource step = next(subtask)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 796, in action_handler_task
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource done = check(handler_data)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 535, in check_delete_complete
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource return self._check_status_complete(self.DELETE)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 404, in _check_status_complete
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource action=action)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource ResourceFailure: ConcurrentTransaction_Remote: resources.mz6kpmaz34kh.resources.deployment_bastion_node_cleanup: Concurrent transacti
on for deployments of server 39bb0dea-d89a-4aac-ac46-3dca0ca7c2e3
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 424, in wrapped
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource return func(self, ctx, *args, **kwargs)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 2177, in create_software_deployment
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource stack_user_project_id=stack_user_project_id)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 260, in create_software_deployment
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource cnxt, server_id, stack_user_project_id)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource return Retrying(*dargs, **dkw).call(f, *args, **kw)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource raise attempt.get()
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource six.reraise(self.value[0], self.value[1], self.value[2])
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 111, in _push_metadata_software_deployments
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource raise exception.ConcurrentTransaction(action=action)
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource ConcurrentTransaction: Concurrent transaction for deployments of server 39bb0dea-d89a-4aac-ac46-3dca0ca7c2e3
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource
2016-11-16 08:14:30.667 52980 INFO heat.engine.resource [req-6a43c1e2-40bc-4f9b-aacf-22533d966c20 - 8593e1a57e40462bb3c78722e00487d3 - - -] deleting RandomString "random_hostname_suffix" [test-openshift_nodes-fkqtfy673zzm-ft2r2yigvwbw-bk34p4tkqous-random_hostname_suffix-zsbkgq5xm2pc] Stack "test-openshift_nodes-fkqtfy673zzm-ft2r2yigvwbw-bk34p4tkqous" [78cf4008-e1b6-4804-b91f-d51e94be0d4c]

description: updated
Revision history for this message
Zane Bitter (zaneb) wrote :

We should probably store this data in a separate table and query all the rows that apply to a particular server from the DB when we need it, rather than combine them all into a single blob that then causes contention issues. (This might also solve various issues we have had with updating metadata while a resource is locked.)

In the short term though, Jan came up with this workaround which was fairly effective in spreading out the thundering herd:

  delete_delay_time:
    type: OS::Heat::RandomString
    properties:
      character_classes:
        - min: 2
          class: digits
      length: 2
  delete_delay:
    type: OS::Heat::TestResource
    properties:
      action_wait_secs:
         delete: {get_attr: [delete_delay_time, value]}
    depends_on: deployment_bastion_node_cleanup

Changed in heat:
status: New → Triaged
importance: Undecided → High
Zane Bitter (zaneb)
Changed in heat:
milestone: none → pike-1
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-1 → pike-2
Revision history for this message
Zane Bitter (zaneb) wrote :

This should be considerably improved by the fix for bug 1651768.

I'll leave this open though because it we still need to come up with a solution that works at arbitrary scale.

Rico Lin (rico-lin)
Changed in heat:
milestone: pike-2 → pike-3
Zane Bitter (zaneb)
Changed in heat:
importance: High → Medium
milestone: pike-3 → next
Revision history for this message
vaibhav ganipi (vaibhav003) wrote :

Hi,

I am facing the same issue, is this corrected in the pike release. Can I have an update on this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.