l2_pop flows missing when spawning VMs at a high rate

Bug #1789846 reported by Oleg Bondarev
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Oleg Bondarev

Bug Description

version: Pike, DVR enabled, 28 compute nodes
scenario: spawn 140 VMs concurrently, with pre-created neutron ports
issue: on some compute nodes VMs cannot get IP address, no reply on dhcp broadcasts. All new VMs spawned on the same compute node in the same network have this problem.

it appears that flood table on compute nodes with issue is missing outputs to dhcp nodes:

 cookie=0x679aebcfbb8dc9a2, duration=296.991s, table=22, n_packets=2, n_bytes=220, priority=1,dl_vlan=4 actions=strip_vlan,load:0x14->NXM_NX_TUN_ID[],output:"vxlan-ac1ef47a",output:"vxlan-ac1ef480",output:"vxlan-ac1ef46d",output:"vxlan-ac1ef477",...

Revision history for this message
Oleg Bondarev (obondarev) wrote :

following seen in openvswitch log:

 2018-08-28T17:26:10.843Z|06607|bridge|INFO|bridge br-int: added interface qvoac23292e-66 on port 435
2018-08-28T17:26:11.238Z|06608|bridge|INFO|bridge br-int: added interface qr-ac42c54b-a8 on port 436

and in ovs agent log:

 2018-08-28 22:56:14,710.710 2189 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9b0dc89c-d20a-4d18-b96b-3f7bae840795 - - - - -] Configuration for devices up [u'ac42c54b-a81d-498a-8f4b-f3bb8eb57d46', u'ac23292e-6612-4cb5-88b0-5ff64564e680'] and devices down [] completed.

where ac42c54b is a distributed router interface and ac23292e is a VM port.

So 2 ports are plugged and activated at the agent almost at the same time.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

this code is suspected:

        if agent_active_ports == 1 or self.agent_restarted(context):
            # First port activated on current agent in this network,
            # we have to provide it with the whole list of fdb entries
            agent_fdb_entries = self._create_agent_fdb(port_context,
                                                       agent,
                                                       segment,
                                                       network_id)

            # And notify other agents to add flooding entry
            other_fdb_ports[agent_ip].append(const.FLOODING_ENTRY)

            if agent_fdb_entries[network_id]['ports'].keys():
                self.L2populationAgentNotify.add_fdb_entries(
                    self.rpc_ctx, agent_fdb_entries, agent_host)

where agent_active_ports == 2 in our case so the agent is not provided with the whole list of fdb entries.

changing to "if agent_active_ports in (1,2)" fixed the issue.
Patch to follow.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

It was also seen that even when a DVR port is activated first and a VM port some time later, we may still face the issue. This might be related to the fact that VMs are spawned with pre-created ports, which is not a common case. Making "agent_active_ports == 2" fixes this as well.

Given that the condition also contains " or self.agent_restarted(context)" which makes it True fist 180 sec (by default) after agent restart, I believe the downside of changing 1 to 2 should be negligible.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/598087

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/598087
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b32db30874a7729c4e9209cfc18e106a7e9fc697
Submitter: Zuul
Branch: master

commit b32db30874a7729c4e9209cfc18e106a7e9fc697
Author: Oleg Bondarev <email address hidden>
Date: Thu Aug 30 12:25:07 2018 +0400

    l2 pop: check for more than 1 first active port on a node

    With high concurrency more than 1 port may be activated on an
    OVS agent at the same time (like VM port + a DVR port),
    so the patch mitigates the condition by checking for 1 or 2
    first active ports.

    Given that the condition also contains "or self.agent_restarted(context)"
    which makes it True first 180 sec (by default) after agent restart,
    I believe the downside of changing 1 to 2 should be negligible.

    Please see bug for more details on the issue.

    Closes-Bug: #1789846
    Change-Id: Ieab0186cbe05185d47bbf5a31141563cf923f66f

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/600130

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/600131

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/600151

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/600151
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1ba5e6964b5eadd660f2f333e774ce174c4e6f06
Submitter: Zuul
Branch: stable/pike

commit 1ba5e6964b5eadd660f2f333e774ce174c4e6f06
Author: Oleg Bondarev <email address hidden>
Date: Thu Aug 30 12:25:07 2018 +0400

    l2 pop: check for more than 1 first active port on a node

    With high concurrency more than 1 port may be activated on an
    OVS agent at the same time (like VM port + a DVR port),
    so the patch mitigates the condition by checking for 1 or 2
    first active ports.

    Given that the condition also contains "or self.agent_restarted(context)"
    which makes it True first 180 sec (by default) after agent restart,
    I believe the downside of changing 1 to 2 should be negligible.

    Please see bug for more details on the issue.

    Closes-Bug: #1789846
    Change-Id: Ieab0186cbe05185d47bbf5a31141563cf923f66f
    (cherry picked from commit b32db30874a7729c4e9209cfc18e106a7e9fc697)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/600131
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ff7923f48d41ceccfcc3ab62558db0905b50cf06
Submitter: Zuul
Branch: stable/queens

commit ff7923f48d41ceccfcc3ab62558db0905b50cf06
Author: Oleg Bondarev <email address hidden>
Date: Thu Aug 30 12:25:07 2018 +0400

    l2 pop: check for more than 1 first active port on a node

    With high concurrency more than 1 port may be activated on an
    OVS agent at the same time (like VM port + a DVR port),
    so the patch mitigates the condition by checking for 1 or 2
    first active ports.

    Given that the condition also contains "or self.agent_restarted(context)"
    which makes it True first 180 sec (by default) after agent restart,
    I believe the downside of changing 1 to 2 should be negligible.

    Please see bug for more details on the issue.

    Closes-Bug: #1789846
    Change-Id: Ieab0186cbe05185d47bbf5a31141563cf923f66f
    (cherry picked from commit b32db30874a7729c4e9209cfc18e106a7e9fc697)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/600130
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dbc9441de3a7544a0980957e03c0e2cb0fd75954
Submitter: Zuul
Branch: stable/rocky

commit dbc9441de3a7544a0980957e03c0e2cb0fd75954
Author: Oleg Bondarev <email address hidden>
Date: Thu Aug 30 12:25:07 2018 +0400

    l2 pop: check for more than 1 first active port on a node

    With high concurrency more than 1 port may be activated on an
    OVS agent at the same time (like VM port + a DVR port),
    so the patch mitigates the condition by checking for 1 or 2
    first active ports.

    Given that the condition also contains "or self.agent_restarted(context)"
    which makes it True first 180 sec (by default) after agent restart,
    I believe the downside of changing 1 to 2 should be negligible.

    Please see bug for more details on the issue.

    Closes-Bug: #1789846
    Change-Id: Ieab0186cbe05185d47bbf5a31141563cf923f66f
    (cherry picked from commit b32db30874a7729c4e9209cfc18e106a7e9fc697)

tags: added: in-stable-rocky
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.2

This issue was fixed in the openstack/neutron 13.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.5

This issue was fixed in the openstack/neutron 12.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b1

This issue was fixed in the openstack/neutron 14.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649417

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/649417
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e8033b74c1caf6af6428cd814bac56cda2a8e600
Submitter: Zuul
Branch: stable/ocata

commit e8033b74c1caf6af6428cd814bac56cda2a8e600
Author: Oleg Bondarev <email address hidden>
Date: Thu Aug 30 12:25:07 2018 +0400

    l2 pop: check for more than 1 first active port on a node

    With high concurrency more than 1 port may be activated on an
    OVS agent at the same time (like VM port + a DVR port),
    so the patch mitigates the condition by checking for 1 or 2
    first active ports.

    Given that the condition also contains "or self.agent_restarted(context)"
    which makes it True first 180 sec (by default) after agent restart,
    I believe the downside of changing 1 to 2 should be negligible.

    Please see bug for more details on the issue.

    Closes-Bug: #1789846
    Change-Id: Ieab0186cbe05185d47bbf5a31141563cf923f66f
    (cherry picked from commit b32db30874a7729c4e9209cfc18e106a7e9fc697)
    (cherry picked from commit 1ba5e6964b5eadd660f2f333e774ce174c4e6f06)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.