[L3] DVR router in compute node was not up but nova port needs its functionality

Bug #1813787 reported by LIU Yulong
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
Wishlist
Unassigned

Bug Description

There is a race condition between nova-compute boots instance and l3-agent processes DVR (local) router in compute node.
This issue can be seen when a large number of instances were booted to one same host, and instances are under different DVR router.
So the l3-agent will concurrently process all these dvr router in this host at the same time.
Although we have a green pool for the router ResourceProcessingQueue with 8 greenlet,
https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L642
some of these routers can still be waiting, event worse thing is that there are time-consuming actions during the router processing procedure.
For instance, installing arp entrys, iptables rules, route rules etc.
So when the VM is up, it will try to get meta via the local proxy hosting by the dvr router. But the router is not ready yet in that host.
And finally those instances will not be able to setup some config in the guest OS.

Some potential solutions:
(1) increase that green pool room
(2) still (provisioning) block the VM port to be set to ACTIVE until the dvr router is up in that host for the first one.

Revision history for this message
Brian Haley (brian-haley) wrote :

I think (1) and (2) will both help, with the provisioning block probably working in more cases perhaps.

The other change is batching the DVR ARP entry processing, which there is another bug for and has been proposed a couple of times, but nothing has merged yet. I believe that is one of the longer operations when creating the router.

Maybe figuring out where most time is spent is a good first step so we can prioritize what area to address first.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/633869

LIU Yulong (dragon889)
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/633871

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/633869
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b
Submitter: Zuul
Branch: master

commit 837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b
Author: LIU Yulong <email address hidden>
Date: Wed Jan 30 09:54:52 2019 +0800

    Dynamically increase l3 router process queue green pool size

    There is a race condition between nova-compute boots instance and
    l3-agent processes DVR (local) router in compute node. This issue
    can be seen when a large number of instances were booted to one
    same host, and instances are under different DVR router. So the
    l3-agent will concurrently process all these dvr routers in this
    host at the same time.
    For now we have a green pool for the router ResourceProcessingQueue
    with 8 greenlet, but some of these routers can still be waiting, event
    worse thing is that there are time-consuming actions during the router
    processing procedure. For instance, installing arp entries, iptables
    rules, route rules etc.
    So when the VM is up, it will try to get meta via the local proxy
    hosting by the dvr router. But the router is not ready yet in that
    host. And finally those instances will not be able to setup some
    config in the guest OS.

    This patch adds a new measurement based on the router quantity to
    indicate the L3 router process queue green pool size. The pool size
    will be limit from 8 (original value) to 32, because we do not want
    the L3 agent cost too much host resource on processing router in the
    compute node.

    Related-Bug: #1813787
    Change-Id: I62393864a103d666d5d9d379073f5fc23ac7d114

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/641490

Changed in neutron:
importance: Undecided → Medium
Miguel Lavalle (minsel)
Changed in neutron:
milestone: none → stein-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/641490
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7369b69e2ef5b1b3c30b237885c2648c63f1dffb
Submitter: Zuul
Branch: master

commit 7369b69e2ef5b1b3c30b237885c2648c63f1dffb
Author: Brian Haley <email address hidden>
Date: Wed Mar 6 16:47:27 2019 -0500

    Dynamically increase DHCP process queue green pool size

    As done for the l3-agent in 837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b,
    dynamically resize the DHCP process queue green pool.

    This patch adds a new measurement based on the network quantity to
    indicate the DHCP process queue green pool size. The pool size
    will be limited from 8 (original value) to 32, because we do not want
    to increase the DHCP agent processing cost on the node.

    Change-Id: Ic0e7bc15f138273c7a6ad41f228c9f315e6c7a91
    Related-Bug: #1813787

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/654815

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/654815
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9d60716cf1d61286a684f20ef8e05c77a0df5aa3
Submitter: Zuul
Branch: master

commit 9d60716cf1d61286a684f20ef8e05c77a0df5aa3
Author: LIU Yulong <email address hidden>
Date: Tue Apr 23 15:27:02 2019 +0800

    Add update_id for ResourceUpdate

    Add a unique id for resource update, then we can calculate
    the resource processing time and track it.

    Related-Bug: #1825152
    Related-Bug: #1824911
    Related-Bug: #1821912
    Related-Bug: #1813787

    Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/660758

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/660758
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fe4fc33f1c1bbb655e99733edf762e3e9debdd3e
Submitter: Zuul
Branch: stable/stein

commit fe4fc33f1c1bbb655e99733edf762e3e9debdd3e
Author: LIU Yulong <email address hidden>
Date: Tue Apr 23 15:27:02 2019 +0800

    Add update_id for ResourceUpdate

    Add a unique id for resource update, then we can calculate
    the resource processing time and track it.

    Related-Bug: #1825152
    Related-Bug: #1824911
    Related-Bug: #1821912
    Related-Bug: #1813787

    Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef
    (cherry picked from commit 9d60716cf1d61286a684f20ef8e05c77a0df5aa3)

tags: added: in-stable-stein
Revision history for this message
LIU Yulong (dragon889) wrote :

Increase the bug level, because this issue has been submitted for a long time.

Changed in neutron:
importance: Medium → High
Changed in neutron:
milestone: stein-rc1 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/728287

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/728288

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/728288
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90496824c0253d2534f299ebcf5dc00774f70fe7
Submitter: Zuul
Branch: stable/queens

commit 90496824c0253d2534f299ebcf5dc00774f70fe7
Author: LIU Yulong <email address hidden>
Date: Wed Jan 30 09:54:52 2019 +0800

    Dynamically increase l3 router process queue green pool size

    There is a race condition between nova-compute boots instance and
    l3-agent processes DVR (local) router in compute node. This issue
    can be seen when a large number of instances were booted to one
    same host, and instances are under different DVR router. So the
    l3-agent will concurrently process all these dvr routers in this
    host at the same time.
    For now we have a green pool for the router ResourceProcessingQueue
    with 8 greenlet, but some of these routers can still be waiting, event
    worse thing is that there are time-consuming actions during the router
    processing procedure. For instance, installing arp entries, iptables
    rules, route rules etc.
    So when the VM is up, it will try to get meta via the local proxy
    hosting by the dvr router. But the router is not ready yet in that
    host. And finally those instances will not be able to setup some
    config in the guest OS.

    This patch adds a new measurement based on the router quantity to
    indicate the L3 router process queue green pool size. The pool size
    will be limit from 8 (original value) to 32, because we do not want
    the L3 agent cost too much host resource on processing router in the
    compute node.

    Related-Bug: #1813787
    Change-Id: I62393864a103d666d5d9d379073f5fc23ac7d114
    (cherry picked from commit 837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/728287
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0b9f4f275c681feb9dcbf4bef0ff29d5344f0fdc
Submitter: Zuul
Branch: stable/rocky

commit 0b9f4f275c681feb9dcbf4bef0ff29d5344f0fdc
Author: LIU Yulong <email address hidden>
Date: Wed Jan 30 09:54:52 2019 +0800

    Dynamically increase l3 router process queue green pool size

    There is a race condition between nova-compute boots instance and
    l3-agent processes DVR (local) router in compute node. This issue
    can be seen when a large number of instances were booted to one
    same host, and instances are under different DVR router. So the
    l3-agent will concurrently process all these dvr routers in this
    host at the same time.
    For now we have a green pool for the router ResourceProcessingQueue
    with 8 greenlet, but some of these routers can still be waiting, event
    worse thing is that there are time-consuming actions during the router
    processing procedure. For instance, installing arp entries, iptables
    rules, route rules etc.
    So when the VM is up, it will try to get meta via the local proxy
    hosting by the dvr router. But the router is not ready yet in that
    host. And finally those instances will not be able to setup some
    config in the guest OS.

    This patch adds a new measurement based on the router quantity to
    indicate the L3 router process queue green pool size. The pool size
    will be limit from 8 (original value) to 32, because we do not want
    the L3 agent cost too much host resource on processing router in the
    compute node.

    Conflicts:
        neutron/tests/functional/agent/l3/test_legacy_router.py

    Related-Bug: #1813787
    Change-Id: I62393864a103d666d5d9d379073f5fc23ac7d114
    (cherry picked from commit 837c9283abd4ccb56d5b4ad0eb1ca435cd2fdf3b)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/757662

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/757662
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=297f4c20c4c914f7fc8528e70d572adbeff68b0f
Submitter: Zuul
Branch: stable/rocky

commit 297f4c20c4c914f7fc8528e70d572adbeff68b0f
Author: LIU Yulong <email address hidden>
Date: Tue Apr 23 15:27:02 2019 +0800

    Add update_id for ResourceUpdate

    Add a unique id for resource update, then we can calculate
    the resource processing time and track it.

    Related-Bug: #1825152
    Related-Bug: #1824911
    Related-Bug: #1821912
    Related-Bug: #1813787

    Conflicts:
            neutron/agent/l3/agent.py

    Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef
    (cherry picked from commit 9d60716cf1d61286a684f20ef8e05c77a0df5aa3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "liuyulong <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/633871
Reason: Long time no reponse, restore if someone wants to overcome this issue.

Revision history for this message
LIU Yulong (dragon889) wrote :

Long time no reponse to the patch https://review.opendev.org/c/openstack/neutron/+/633871, restore if someone wants to overcome this issue.

Changed in neutron:
assignee: LIU Yulong (dragon889) → nobody
status: In Progress → New
importance: High → Wishlist
Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "liuyulong <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/633871
Reason: Restore if someday we want this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers