Race condition in subnet and segment delete: The segment is still bound with port(s)

Bug #1878632 reported by Bence Romsics
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Bence Romsics

Bug Description

The HOT template below may expose a race condition and by that make stack deletion fail. On the neutron API this means that a segment delete fails with "The segment is still bound with port(s)". The reproduction uses a HOT template but I don't think this problem is Heat specific. Rather I think it depends on quick succession of API calls, which Heat does rather well.

Configuration:

ml2_conf.ini
[ml2]
mechanism_drivers = openvswitch,linuxbridge,sriovnicswitch,l2population
tenant_network_types = vxlan,vlan
[ml2_type_vlan]
network_vlan_ranges = physnet0:200:400,physnet1:200:400,...

sriov_agent.ini
[sriov_nic]
physical_device_mappings = physnet1:ens5

ovs_agent.ini
[ovs]
bridge_mappings = physnet0:br-physnet0,...

dhcp_agent.ini
[DEFAULT]
interface_driver = openvswitch

Reproduction:

# make this config live
$ sudo systemctl restart devstack@neutron-*

# create stack
$ openstack stack create -t ~/bug-segment-delete-does-not-delete-dhcp-ports.yml s0 --wait -f value -c stack_status
2020-05-14 14:23:23Z [s0]: CREATE_IN_PROGRESS Stack CREATE started
2020-05-14 14:23:24Z [s0.net0]: CREATE_IN_PROGRESS state changed
2020-05-14 14:23:24Z [s0.net0]: CREATE_COMPLETE state changed
2020-05-14 14:23:24Z [s0.segment0]: CREATE_IN_PROGRESS state changed
2020-05-14 14:23:25Z [s0.segment0]: CREATE_COMPLETE state changed
2020-05-14 14:23:25Z [s0.subnet0]: CREATE_IN_PROGRESS state changed
2020-05-14 14:23:26Z [s0.subnet0]: CREATE_COMPLETE state changed
2020-05-14 14:23:26Z [s0]: CREATE_COMPLETE Stack CREATE completed successfully
CREATE_COMPLETE

# wait until the dhcp port is created and it becomes ACTIVE
$ openstack stack resource show s0 net0 -f value -c physical_resource_id | xargs -r openstack port list --device-owner network:dhcp --network
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------+--------+
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------+--------+
| 8cf8f188-5ea4-41b0-aa3a-fb8a8802888d | | fa:16:3e:d1:59:1d | ip_address='10.0.4.2', subnet_id='376d2581-3765-4071-bb8e-41a684e235ff' | ACTIVE |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------+--------+

# the dhcp port is not created by heat of course
$ openstack stack resource list s0
+---------------+--------------------------------------+----------------------+-----------------+----------------------+
| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+---------------+--------------------------------------+----------------------+-----------------+----------------------+
| net0 | 80ad5941-f581-4477-95e8-909c6140063d | OS::Neutron::Net | CREATE_COMPLETE | 2020-05-14T14:28:03Z |
| subnet0 | 376d2581-3765-4071-bb8e-41a684e235ff | OS::Neutron::Subnet | CREATE_COMPLETE | 2020-05-14T14:28:03Z |
| segment0 | 641c8c60-59c9-4972-bf82-3637f3e0f1cb | OS::Neutron::Segment | CREATE_COMPLETE | 2020-05-14T14:28:03Z |
+---------------+--------------------------------------+----------------------+-----------------+----------------------+

# stack delete fails
$ openstack stack delete s0 --yes --wait
2020-05-14 14:37:10Z [s0]: DELETE_IN_PROGRESS Stack DELETE started
2020-05-14 14:37:10Z [s0.subnet0]: DELETE_IN_PROGRESS state changed
2020-05-14 14:37:11Z [s0.subnet0]: DELETE_COMPLETE state changed
2020-05-14 14:37:11Z [s0.segment0]: DELETE_IN_PROGRESS state changed
2020-05-14 14:37:11Z [s0.segment0]: DELETE_FAILED ConflictException: resources.segment0: ConflictException: 409: Client Error for url: http://192.168.122.246:9696/v2.0/segments/641c8c60-59c9-4972-bf82-3637f3e0f1cb, Segment '641c8c60-59c9-4972-bf82-3637f3e0f1cb' cannot be deleted: The segment is still bou
2020-05-14 14:37:11Z [s0]: DELETE_FAILED Resource DELETE failed: ConflictException: resources.segment0: ConflictException: 409: Client Error for url: http://192.168.122.246:9696/v2.0/segments/641c8c60-59c9-4972-bf82-3637f3e0f1cb, Segment '641c8c60-59c9-4972-bf82-3637f3e0f1cb' cannot be deleted:·

 Stack s0 DELETE_FAILED·

Unable to delete 1 of the 1 stacks.

# during that heat-engine logged this
$ sudo journalctl -u devstack@h-eng -f | egrep -w ERROR
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource Traceback (most recent call last):
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/opt/stack/heat/heat/engine/resource.py", line 918, in _action_recorder
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource yield
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/opt/stack/heat/heat/engine/resource.py", line 2051, in delete
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource *action_args)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/opt/stack/heat/heat/engine/scheduler.py", line 326, in wrapper
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource step = next(subtask)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/opt/stack/heat/heat/engine/resource.py", line 972, in action_handler_task
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource handler_data = handler(*args)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/opt/stack/heat/heat/engine/resources/openstack/neutron/segment.py", line 146, in handle_delete
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource self.client('openstack').network.delete_segment(self.resource_id)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/usr/local/lib/python3.6/dist-packages/openstack/network/v2/_proxy.py", line 3312, in delete_segment
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource self._delete(_segment.Segment, segment, ignore_missing=ignore_missing)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/usr/local/lib/python3.6/dist-packages/openstack/proxy.py", line 46, in check
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource return method(self, expected, actual, *args, **kwargs)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/usr/local/lib/python3.6/dist-packages/openstack/network/v2/_proxy.py", line 75, in _delete
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource rv = res.delete(self, if_revision=if_revision)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/usr/local/lib/python3.6/dist-packages/openstack/resource.py", line 1615, in delete
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource self._translate_response(response, has_body=False, **kwargs)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/usr/local/lib/python3.6/dist-packages/openstack/resource.py", line 1113, in _translate_response
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource exceptions.raise_from_response(response, error_message=error_message)
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource File "/usr/local/lib/python3.6/dist-packages/openstack/exceptions.py", line 236, in raise_from_response
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource http_status=http_status, request_id=request_id
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: http://192.168.122.246:9696/v2.0/segments/641c8c60-59c9-4972-bf82-3637f3e0f1cb, Segment '641c8c60-59c9-4972-bf82-3637f3e0f1cb' cannot be deleted: The segment is still bound with port(s) 8cf8f188-5ea4-41b0-aa3a-fb8a8802888d.
máj 14 14:37:11 devstack1 heat-engine[12508]: ERROR heat.engine.resource

# a few seconds later a second delete succeeds
$ openstack stack delete s0 --yes --wait
2020-05-14 14:24:26Z [s0]: DELETE_IN_PROGRESS Stack DELETE started

I have an idea what the root cause is. I'll describe that in a comment.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

The HOT template:

heat_template_version: newton

description: >

  Reproduce a race condition bug: When deleting this template segment
  delete may fail because subnet delete triggers async deletion of
  dhcp ports, but heat may try to delete the segment earlier while
  it still has dhcp ports on it and that will fail.

parameters:

  physnet_sriov:
    type: string
    default: physnet1

  physnet_ovs:
    type: string
    default: physnet0

  vlan:
    type: number
    default: 200

resources:

  net0:
    type: OS::Neutron::Net
    properties:
      value_specs:
        provider:network_type: vlan
        provider:segmentation_id:
          get_param: vlan
        provider:physical_network:
          get_param: physnet_sriov

  segment0:
    type: OS::Neutron::Segment
    properties:
      network:
        get_resource: net0
      network_type: vlan
      physical_network:
        get_param: physnet_ovs
      segmentation_id:
        get_param: vlan

  subnet0:
    type: OS::Neutron::Subnet
    properties:
      enable_dhcp: true
      network_id:
        get_resource: net0
      cidr: 10.0.4.0/24
    depends_on:
      - segment0

Revision history for this message
Bence Romsics (bence-romsics) wrote :

neutron commit 34448578cb

Revision history for this message
Bence Romsics (bence-romsics) wrote :

So about what I think the root cause is:

During the stack deletion neutron may delete dhcp ports in two ways:

* triggered by subnet deletion
* triggered by network deletion

AFAICT subnet deletion first leads to an rpc message to the dhcp agent, which deconfigures the subnet deleted, sends another rpc message back to the server, which finally deletes the dhcp port. Please see the message back to neutron-server here:

https://opendev.org/openstack/neutron/src/commit/cb55643a0695ebc5b41f50f6edb1546bcc676b71/neutron/api/rpc/handlers/dhcp_rpc.py#L239-L249

This is usually quite quick, but it does not and cannot happen synchronously in the subnet delete API transaction. Simply polling for the subnet to be gone is not enough. Plus unless a client starts having assumptions about which ports are going to be autodeleted by neutron, that client does have no way to tell when these ports are gone.

Another place where we auto-delete dhcp ports is in network delete:

https://opendev.org/openstack/neutron/src/commit/cb55643a0695ebc5b41f50f6edb1546bcc676b71/neutron/db/db_base_plugin_v2.py#L485-L495

This of course has to be synchronous, otherwise we'd be breaking the consistency of our object model (can't have ports with dangling networks).

But we don't do the same in segment delete:

https://opendev.org/openstack/neutron/src/commit/cb55643a0695ebc5b41f50f6edb1546bcc676b71/neutron/plugins/ml2/db.py#L327-L328

Please also note the 'depends_on' in the HOT template (i.e. the subnet depends_on the segment). Some may argue that the need for this is another bug, but I'm not addressing that here.

With this depends_on heat deletes the subnet first and the segment second, but since the dhcp ports are usually deleted a bit slower, segment deletion fails with "The segment is still bound with port(s)". A second try a few seconds later usually succeeds.

Since the cause is timing dependent, the reproduction is not guaranteed to work every time, but for me practically it does.

For a fix I think we should auto-delete dhcp ports in segment delete. I'll work on a patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/728904

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/729250

Changed in neutron:
status: New → In Progress
Changed in neutron:
assignee: Bence Romsics (bence-romsics) → Lajos Katona (lajos-katona)
Changed in neutron:
assignee: Lajos Katona (lajos-katona) → Bence Romsics (bence-romsics)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/729250
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=da45bbbff4e10e3d924ac3d85c9f382623708b11
Submitter: Zuul
Branch: master

commit da45bbbff4e10e3d924ac3d85c9f382623708b11
Author: Bence Romsics <email address hidden>
Date: Tue May 19 13:21:52 2020 +0200

    Auto-delete dhcp ports on segment delete

    Subnet delete triggers dhcp port deletion but asynchronously,
    therefore in the condition described in the bug report we may
    get a conflict when deleting the segment too fast after the subnet.

    Here we follow the example of how we auto-delete ports in net delete.

    Please also find a fullstack test in Related-Change below.

    Change-Id: Iba02f5a2211b18c2deb9097daad6be5e7d21faf8
    Closes-Bug: #1878632
    Related-Change: https://review.opendev.org/728904

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/740659

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/740660

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/740660
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=96fe5d35093c2a74717127a52fa517b41dbb8d19
Submitter: Zuul
Branch: stable/train

commit 96fe5d35093c2a74717127a52fa517b41dbb8d19
Author: Bence Romsics <email address hidden>
Date: Tue May 19 13:21:52 2020 +0200

    Auto-delete dhcp ports on segment delete

    Subnet delete triggers dhcp port deletion but asynchronously,
    therefore in the condition described in the bug report we may
    get a conflict when deleting the segment too fast after the subnet.

    Here we follow the example of how we auto-delete ports in net delete.

    Please also find a fullstack test in Related-Change below.

    Change-Id: Iba02f5a2211b18c2deb9097daad6be5e7d21faf8
    Closes-Bug: #1878632
    Related-Change: https://review.opendev.org/728904
    (cherry picked from commit da45bbbff4e10e3d924ac3d85c9f382623708b11)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/740659
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b4130c0dc1787e30a70ab33486047fee720ab286
Submitter: Zuul
Branch: stable/ussuri

commit b4130c0dc1787e30a70ab33486047fee720ab286
Author: Bence Romsics <email address hidden>
Date: Tue May 19 13:21:52 2020 +0200

    Auto-delete dhcp ports on segment delete

    Subnet delete triggers dhcp port deletion but asynchronously,
    therefore in the condition described in the bug report we may
    get a conflict when deleting the segment too fast after the subnet.

    Here we follow the example of how we auto-delete ports in net delete.

    Please also find a fullstack test in Related-Change below.

    Change-Id: Iba02f5a2211b18c2deb9097daad6be5e7d21faf8
    Closes-Bug: #1878632
    Related-Change: https://review.opendev.org/728904
    (cherry picked from commit da45bbbff4e10e3d924ac3d85c9f382623708b11)

tags: added: in-stable-ussuri
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers