Bug #1813715 “[L2][scale issue] ovs-agent meets unexpected tunne...” : Bugs : neutron

Swaminathan Vasudevan (swaminathan-vasudevan) on 2019-01-29

tags:

added: ovs

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2019-01-29:

#1

I think the base root cause is that the ovs-agent and openvswitchd is not able to keep up with the connections at heavy load.
So probably these issues are side effects and probably can't be fixed on its own.
So I would say that we should focus on the efficiency of the ovs-agent handling the port info.
Also you mentioned that this happens when there are just 200VM ports as part of the compute, so the ovs-agent on the compute side may be able to handle more than 200VM ports, but on the server side we may not be able to handle if the ports size is 2000 or more than 2000 (Right).
Not sure if this an issue with the Ml2 mechanism driver/DB/rpc issues.

Revision history for this message

LIU Yulong (dragon889) wrote on 2019-01-30:

#2

Currently we do not have the clue of this issue, but it is indeed can be seen in our environment. Maybe as you said the heavy load is one problem.
No, we did not see much exception in neutron server side. For the compute node, although it will only host 150-200 compute port. But it will try to get all the subnet port infos during the restart/sync/sg install. This costs too much. And maybe some partial action or data may cause the tunnel down or lose.

Swaminathan Vasudevan (swaminathan-vasudevan) on 2019-02-07

Changed in neutron:
status:	New → Confirmed

LIU Yulong (dragon889) on 2019-02-22

summary:

- [L2][scale issue] ovs-agent meet unexpected tunnel lost
+ [L2][scale issue] ovs-agent meets unexpected tunnel lost

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-04: Fix proposed to neutron (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/640797

Changed in neutron:
assignee:	nobody → LIU Yulong (dragon889)
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2019-03-07

Changed in neutron:
assignee:	LIU Yulong (dragon889) → Brian Haley (brian-haley)

OpenStack Infra (hudson-openstack) on 2019-03-07

Changed in neutron:
assignee:	Brian Haley (brian-haley) → LIU Yulong (dragon889)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix proposed to neutron (stable/rocky)

#4

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/645405

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix proposed to neutron (stable/queens)

#5

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/645406

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix proposed to neutron (stable/pike)

#6

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/645408

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-23: Fix merged to neutron (master)

#7

Reviewed: https://review.openstack.org/640797
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch: master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-24: Fix included in openstack/neutron 14.0.0.0rc1

#8

This issue was fixed in the openstack/neutron 14.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-03: Fix proposed to neutron (stable/ocata)

#9

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649729

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-06: Fix merged to neutron (stable/pike)

#10

Reviewed: https://review.openstack.org/645408
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=62fe7852bbd70a24174853997096c52ee015e269
Submitter: Zuul
Branch: stable/pike

commit 62fe7852bbd70a24174853997096c52ee015e269
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

Conflicts:
neutron/plugins/ml2/rpc.py

Conflicts:
neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-07: Fix merged to neutron (stable/rocky)

#11

Reviewed: https://review.openstack.org/645405
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cc49ab550179bdc76d79f48be67886681cb43d4e
Submitter: Zuul
Branch: stable/rocky

commit cc49ab550179bdc76d79f48be67886681cb43d4e
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

Conflicts:
neutron/plugins/ml2/rpc.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
(cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix included in openstack/neutron 11.0.7

#12

This issue was fixed in the openstack/neutron 11.0.7 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix included in openstack/neutron 13.0.3

#13

This issue was fixed in the openstack/neutron 13.0.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix included in openstack/neutron 12.0.6

#14

This issue was fixed in the openstack/neutron 12.0.6 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-15: Fix merged to neutron (stable/ocata)

#15

Reviewed: https://review.opendev.org/649729
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9583dc0549da2b4529a59b5862ba42aebc5ae15f
Submitter: Zuul
Branch: stable/ocata

commit 9583dc0549da2b4529a59b5862ba42aebc5ae15f
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

Conflicts:
neutron/plugins/ml2/rpc.py

Conflicts:
neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags:

added: in-stable-ocata

Bernard Cafarelli (bcafarel) on 2019-06-19

tags:

added: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-14: Fix included in openstack/neutron ocata-eol

#16

This issue was fixed in the openstack/neutron ocata-eol release.

neutron

[L2][scale issue] ovs-agent meets unexpected tunnel lost

Bug Description

Other bug subscribers

Remote bug watches