[L3] DVR router in compute node was not up but nova port needs its functionality

Bug #1813787 reported by LIU Yulong on 2019-01-29
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
LIU Yulong

Bug Description

There is a race condition between nova-compute boots instance and l3-agent processes DVR (local) router in compute node.
This issue can be seen when a large number of instances were booted to one same host, and instances are under different DVR router.
So the l3-agent will concurrently process all these dvr router in this host at the same time.
Although we have a green pool for the router ResourceProcessingQueue with 8 greenlet,
https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L642
some of these routers can still be waiting, event worse thing is that there are time-consuming actions during the router processing procedure.
For instance, installing arp entrys, iptables rules, route rules etc.
So when the VM is up, it will try to get meta via the local proxy hosting by the dvr router. But the router is not ready yet in that host.
And finally those instances will not be able to setup some config in the guest OS.

Some potential solutions:
(1) increase that green pool room
(2) still (provisioning) block the VM port to be set to ACTIVE until the dvr router is up in that host for the first one.

Brian Haley (brian-haley) wrote :

I think (1) and (2) will both help, with the provisioning block probably working in more cases perhaps.

The other change is batching the DVR ARP entry processing, which there is another bug for and has been proposed a couple of times, but nothing has merged yet. I believe that is one of the longer operations when creating the router.

Maybe figuring out where most time is spent is a good first step so we can prioritize what area to address first.

LIU Yulong (dragon889) on 2019-01-30
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: New → In Progress

Related fix proposed to branch: master
Review: https://review.openstack.org/633871

Reviewed: https://review.openstack.org/633869
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b
Submitter: Zuul
Branch: master

commit 837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b
Author: LIU Yulong <email address hidden>
Date: Wed Jan 30 09:54:52 2019 +0800

    Dynamically increase l3 router process queue green pool size

    There is a race condition between nova-compute boots instance and
    l3-agent processes DVR (local) router in compute node. This issue
    can be seen when a large number of instances were booted to one
    same host, and instances are under different DVR router. So the
    l3-agent will concurrently process all these dvr routers in this
    host at the same time.
    For now we have a green pool for the router ResourceProcessingQueue
    with 8 greenlet, but some of these routers can still be waiting, event
    worse thing is that there are time-consuming actions during the router
    processing procedure. For instance, installing arp entries, iptables
    rules, route rules etc.
    So when the VM is up, it will try to get meta via the local proxy
    hosting by the dvr router. But the router is not ready yet in that
    host. And finally those instances will not be able to setup some
    config in the guest OS.

    This patch adds a new measurement based on the router quantity to
    indicate the L3 router process queue green pool size. The pool size
    will be limit from 8 (original value) to 32, because we do not want
    the L3 agent cost too much host resource on processing router in the
    compute node.

    Related-Bug: #1813787
    Change-Id: I62393864a103d666d5d9d379073f5fc23ac7d114

Changed in neutron:
importance: Undecided → Medium
Miguel Lavalle (minsel) on 2019-03-18
Changed in neutron:
milestone: none → stein-rc1

Reviewed: https://review.openstack.org/641490
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7369b69e2ef5b1b3c30b237885c2648c63f1dffb
Submitter: Zuul
Branch: master

commit 7369b69e2ef5b1b3c30b237885c2648c63f1dffb
Author: Brian Haley <email address hidden>
Date: Wed Mar 6 16:47:27 2019 -0500

    Dynamically increase DHCP process queue green pool size

    As done for the l3-agent in 837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b,
    dynamically resize the DHCP process queue green pool.

    This patch adds a new measurement based on the network quantity to
    indicate the DHCP process queue green pool size. The pool size
    will be limited from 8 (original value) to 32, because we do not want
    to increase the DHCP agent processing cost on the node.

    Change-Id: Ic0e7bc15f138273c7a6ad41f228c9f315e6c7a91
    Related-Bug: #1813787

Reviewed: https://review.opendev.org/654815
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9d60716cf1d61286a684f20ef8e05c77a0df5aa3
Submitter: Zuul
Branch: master

commit 9d60716cf1d61286a684f20ef8e05c77a0df5aa3
Author: LIU Yulong <email address hidden>
Date: Tue Apr 23 15:27:02 2019 +0800

    Add update_id for ResourceUpdate

    Add a unique id for resource update, then we can calculate
    the resource processing time and track it.

    Related-Bug: #1825152
    Related-Bug: #1824911
    Related-Bug: #1821912
    Related-Bug: #1813787

    Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef

Reviewed: https://review.opendev.org/660758
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fe4fc33f1c1bbb655e99733edf762e3e9debdd3e
Submitter: Zuul
Branch: stable/stein

commit fe4fc33f1c1bbb655e99733edf762e3e9debdd3e
Author: LIU Yulong <email address hidden>
Date: Tue Apr 23 15:27:02 2019 +0800

    Add update_id for ResourceUpdate

    Add a unique id for resource update, then we can calculate
    the resource processing time and track it.

    Related-Bug: #1825152
    Related-Bug: #1824911
    Related-Bug: #1821912
    Related-Bug: #1813787

    Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef
    (cherry picked from commit 9d60716cf1d61286a684f20ef8e05c77a0df5aa3)

tags: added: in-stable-stein
LIU Yulong (dragon889) wrote :

Increase the bug level, because this issue has been submitted for a long time.

Changed in neutron:
importance: Medium → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers