ML2 mech driver sometimes receives network context without provider attributes in delete_network_postcommit

Bug #1841967 reported by Mark Goddard
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Rodolfo Alonso

Bug Description

When a network is deleted, sometimes the delete_network_postcommit method of my ML2 mechanism driver receives a network object in the context that has the provider attributes set to None.

I am using Rocky (13.0.4), on CentOS 7.5 + RDO, and kolla-ansible. I have three controllers running neutron-server.

Specifically, the mechanism driver is networking-generic-switch. It needs the provider information in order to configure VLANs on physical switches, and without it I am left with stale switch configuration.

In my testing I have found that reducing the number of neutron-server instances reduces the likelihood of seeing this issue. I did not see it with only one instance running, but only tested ~10 times.

I have collected logs from a broken case and a working case, and one key difference I can see is that in the working case I see two of these messages, and in the broken case I see three:

Network 3ed87da6-0b3a-455a-b813-7d069dc9e112 has no segments _extend_network_dict_provider /usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py:168

Indeed, _extend_network_dict_provider sets the provider attributes to None if there are no segments found in the DB.

It seems to be a race condition between segment deletion and creation of the _mech_context in the network precommit.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Logs from a working delete: http://paste.openstack.org/show/767151/
Logs from a failing delete: http://paste.openstack.org/show/767152/

Revision history for this message
Mark Goddard (mgoddard) wrote :

Looking at the code a little more, I think the problem is that the segments extension subscribes to the delete network precommit callback (_delete_segments_for_network), which deletes the segments associated with the network. I don't know how this interacts with the precommit callback for the ML2 plugin (_network_delete_precommit_handler), which sets the mech context.

tags: added: networking-generic-switc segments
tags: added: networking-generic-switch
removed: networking-generic-switc
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Hi,
Thanks for the bug report.
Based on the logs you provided I am not sure I see the exact problem.
Do you think that I can reproduce the issue without networking-generic-switch?

Revision history for this message
Mark Goddard (mgoddard) wrote :

Thanks for responding. I don't believe this issue is specific to NGS.

To reproduce, try adding a log message to an ML2 driver's delete_network_postcommit, that prints the values of the provider attributes from context.current. If you create and delete a few vlan provider networks, you should see that sometimes the fields are None.

Note that I have only tested this on rocky.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Bear in mind that I needed multiple neutron servers to reproduce.

Revision history for this message
Mark Goddard (mgoddard) wrote :

After further investigation, multiple neutron server instances are not necessary. A given instance of neutron-server has a 50% chance of hitting the issue, dependent upon the ordering of precommit callbacks (segments and ML2) in a dictionary. Restarting neutron-server rolls the dice again.

Revision history for this message
Mark Goddard (mgoddard) wrote :

I have a patch for this which I'll push soon.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/679483

Changed in neutron:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Changed in neutron:
importance: Undecided → Low
Revision history for this message
Mark Goddard (mgoddard) wrote :

I think this is at least Medium - it can lead to stale network configuration, and possible subsequent failures.

Changed in neutron:
assignee: Mark Goddard (mgoddard) → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/679483
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fea2d9091f71a2ec88318121ed9a22180e1ae96f
Submitter: Zuul
Branch: master

commit fea2d9091f71a2ec88318121ed9a22180e1ae96f
Author: Mark Goddard <email address hidden>
Date: Fri Aug 30 16:58:34 2019 +0100

    Create _mech_context before delete to avoid race

    When a network is deleted, precommit handlers are notified prior to the
    deletion of the network from the database. One handler exists in the ML2
    plugin - _network_delete_precommit_handler. This handler queries the
    database for the current state of the network and uses it to create a
    NetworkContext which it saves under context._mech_context. When the
    postcommit handler _network_delete_after_delete_handler is triggered
    later, it passess the saved context._mech_context to mechanism drivers.

    A problem can occur with provider networks since the segments service
    also registers a precommit handler - _delete_segments_for_network. Both
    precommit handlers use the default priority, so the order in which they
    are called is random, and determined by dict ordering. If the segment
    precommit handler executes first, it will delete the segments associated
    with the network. When the ML2 plugin precommit handler runs it then
    sees no segments for the network and sets the provider attributes of the
    network in the NetworkContext to None.

    A mechanism driver that is passed a NetworkContext without provider
    attributes in its delete_network_postcommit method will not have the
    information to perform the necessary actions. In the case of the
    networking-generic-switch mechanism driver where this was observed, this
    resulted in the driver ignoring the event, because the network did not
    look like a VLAN.

    This change uses a priority of zero for ML2 network delete precommit
    handler, to ensure they query the network and store the NetworkContext
    before the segments service has a chance to delete segments.

    A similar change has been made for subnets, both to keep the pattern
    consistent and avoid any similar issues.

    Change-Id: I6482223ed2a479de4f5ef4cef056c311c0281408
    Closes-Bug: #1841967
    Depends-On: https://review.opendev.org/680001

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/681662

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/681663

tags: added: pike-backport-potential queens-backport-potential rocky-backport-potential stein-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/681662
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7ae3280c34acb92e0f509566354553dc96df3de7
Submitter: Zuul
Branch: stable/stein

commit 7ae3280c34acb92e0f509566354553dc96df3de7
Author: Mark Goddard <email address hidden>
Date: Fri Aug 30 16:58:34 2019 +0100

    Create _mech_context before delete to avoid race

    When a network is deleted, precommit handlers are notified prior to the
    deletion of the network from the database. One handler exists in the ML2
    plugin - _network_delete_precommit_handler. This handler queries the
    database for the current state of the network and uses it to create a
    NetworkContext which it saves under context._mech_context. When the
    postcommit handler _network_delete_after_delete_handler is triggered
    later, it passess the saved context._mech_context to mechanism drivers.

    A problem can occur with provider networks since the segments service
    also registers a precommit handler - _delete_segments_for_network. Both
    precommit handlers use the default priority, so the order in which they
    are called is random, and determined by dict ordering. If the segment
    precommit handler executes first, it will delete the segments associated
    with the network. When the ML2 plugin precommit handler runs it then
    sees no segments for the network and sets the provider attributes of the
    network in the NetworkContext to None.

    A mechanism driver that is passed a NetworkContext without provider
    attributes in its delete_network_postcommit method will not have the
    information to perform the necessary actions. In the case of the
    networking-generic-switch mechanism driver where this was observed, this
    resulted in the driver ignoring the event, because the network did not
    look like a VLAN.

    This change uses a priority of zero for ML2 network delete precommit
    handler, to ensure they query the network and store the NetworkContext
    before the segments service has a chance to delete segments.

    A similar change has been made for subnets, both to keep the pattern
    consistent and avoid any similar issues.

    Change-Id: I6482223ed2a479de4f5ef4cef056c311c0281408
    Closes-Bug: #1841967
    Depends-On: https://review.opendev.org/680001
    (cherry picked from commit fea2d9091f71a2ec88318121ed9a22180e1ae96f)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/681663
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f913394d70dede8d40d6e96485bd5333dfb6f43f
Submitter: Zuul
Branch: stable/rocky

commit f913394d70dede8d40d6e96485bd5333dfb6f43f
Author: Mark Goddard <email address hidden>
Date: Fri Aug 30 16:58:34 2019 +0100

    Create _mech_context before delete to avoid race

    When a network is deleted, precommit handlers are notified prior to the
    deletion of the network from the database. One handler exists in the ML2
    plugin - _network_delete_precommit_handler. This handler queries the
    database for the current state of the network and uses it to create a
    NetworkContext which it saves under context._mech_context. When the
    postcommit handler _network_delete_after_delete_handler is triggered
    later, it passess the saved context._mech_context to mechanism drivers.

    A problem can occur with provider networks since the segments service
    also registers a precommit handler - _delete_segments_for_network. Both
    precommit handlers use the default priority, so the order in which they
    are called is random, and determined by dict ordering. If the segment
    precommit handler executes first, it will delete the segments associated
    with the network. When the ML2 plugin precommit handler runs it then
    sees no segments for the network and sets the provider attributes of the
    network in the NetworkContext to None.

    A mechanism driver that is passed a NetworkContext without provider
    attributes in its delete_network_postcommit method will not have the
    information to perform the necessary actions. In the case of the
    networking-generic-switch mechanism driver where this was observed, this
    resulted in the driver ignoring the event, because the network did not
    look like a VLAN.

    This change uses a priority of zero for ML2 network delete precommit
    handler, to ensure they query the network and store the NetworkContext
    before the segments service has a chance to delete segments.

    A similar change has been made for subnets, both to keep the pattern
    consistent and avoid any similar issues.

    Change-Id: I6482223ed2a479de4f5ef4cef056c311c0281408
    Closes-Bug: #1841967
    Depends-On: https://review.opendev.org/680001
    (cherry picked from commit fea2d9091f71a2ec88318121ed9a22180e1ae96f)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/684552

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.3

This issue was fixed in the openstack/neutron 14.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.5

This issue was fixed in the openstack/neutron 13.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/queens)

Change abandoned by Lajos Katona (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/684552
Reason: I think it is better to abandon this patch

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.