l3 agent avoid unnecessary full_sync

Bug #1494682 reported by Sudhakar Gariganti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Sudhakar Gariganti

Bug Description

In _process_router_update method, we set full_sync to true a couple of places which can be avoided.

There is even a TODO from Carl saying so.

# TODO(Carl) Stop this fullsync non-sense. Just retry this
# one router by sticking the update at the end of the queue
# at a lower priority.

Changed in neutron:
assignee: nobody → Sudhakar Gariganti (sudhakar-gariganti)
status: New → In Progress
tags: added: l3-ipam-dhcp
removed: l3-dvr-backlog
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/224019

Changed in neutron:
importance: Undecided → Low
Revision history for this message
Sudhakar Gariganti (sudhakar-gariganti) wrote :

From a functionality point of view, I agree it is LOW. But if we see from the scale point of view, this does impact significantly.

A single random RPC timeout@scale will put the l3 agent in indefinite cycle and has terrible impact on the DB and controller operations, which will eventually degrade the performance of other agents as well.

At just a scale of less than 1000 networks, it was taking multiples of hours for the cloud to get back into shape. Imagine the situation at a higher scale.

Agree its late in the cycle, but if there is chance, I feel its good to have this for Liberty.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/224019
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4957b5b43521a61873a041fe3e8989ed399903d9
Submitter: Jenkins
Branch: master

commit 4957b5b43521a61873a041fe3e8989ed399903d9
Author: Sudhakar Babu Gariganti <email address hidden>
Date: Wed Sep 16 15:53:57 2015 +0530

    Avoid full_sync in l3_agent for router updates

    While processing a router update in _process_router_update method,
    if an exception occurs, we try to do a full_sync.

    We only need to re-sync the router whose update failed.

    Addressed a TODO in the same method, which falls in similar lines.

    Change-Id: I7c43a508adf46d8524f1cc48b83f1e1c276a2de0
    Closes-Bug: #1494682

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/259510

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/259708

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/259708
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=822ad5f06bcef8f95f36032d4fd4709975cecc31
Submitter: Jenkins
Branch: master

commit 822ad5f06bcef8f95f36032d4fd4709975cecc31
Author: Assaf Muller <email address hidden>
Date: Sat Dec 19 14:13:43 2015 -0500

    Force L3 agent to resync router it could not configure

    If the L3 agent fails to configure a router, commit:
    4957b5b43521a61873a041fe3e8989ed399903d9 changed it so
    that instead of performing an expensive full sync, only that
    router is reconfigured. However, it tries to reconfigure the
    cached router. This is a change of behavior from the fullsync
    days. The retry is more likely to succeed if the
    router is retrieved from the server, instead of using
    the locally cached version, in case the user or operator
    fixed bad input, or if the router was retrieved in a bad
    state due to a server-side race condition.

    Note that this is only relevant to full syncs, as those retrieve
    routers from the server and queue updates with the router object.
    Incremental updates queue up updates without router objects,
    so if one of those fails it would always be resynced on a
    second attempt.

    Related-Bug: #1494682
    Change-Id: Id0565e11b3023a639589f2734488029f194e2f9d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/261044

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/259510
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=430892ab60480ea084429bab2379f378c5b7c5c8
Submitter: Jenkins
Branch: stable/liberty

commit 430892ab60480ea084429bab2379f378c5b7c5c8
Author: Sudhakar Babu Gariganti <email address hidden>
Date: Wed Sep 16 15:53:57 2015 +0530

    Avoid full_sync in l3_agent for router updates

    While processing a router update in _process_router_update method,
    if an exception occurs, we try to do a full_sync.

    We only need to re-sync the router whose update failed.

    Addressed a TODO in the same method, which falls in similar lines.

    Change-Id: I7c43a508adf46d8524f1cc48b83f1e1c276a2de0
    Closes-Bug: #1494682
    (cherry picked from commit 4957b5b43521a61873a041fe3e8989ed399903d9)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/261044
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d11e9cb550f7b80d2d81327cd17d540352dce43d
Submitter: Jenkins
Branch: stable/liberty

commit d11e9cb550f7b80d2d81327cd17d540352dce43d
Author: Assaf Muller <email address hidden>
Date: Sat Dec 19 14:13:43 2015 -0500

    Force L3 agent to resync router it could not configure

    If the L3 agent fails to configure a router, commit:
    4957b5b43521a61873a041fe3e8989ed399903d9 changed it so
    that instead of performing an expensive full sync, only that
    router is reconfigured. However, it tries to reconfigure the
    cached router. This is a change of behavior from the fullsync
    days. The retry is more likely to succeed if the
    router is retrieved from the server, instead of using
    the locally cached version, in case the user or operator
    fixed bad input, or if the router was retrieved in a bad
    state due to a server-side race condition.

    Note that this is only relevant to full syncs, as those retrieve
    routers from the server and queue updates with the router object.
    Incremental updates queue up updates without router objects,
    so if one of those fails it would always be resynced on a
    second attempt.

    Related-Bug: #1494682
    Change-Id: Id0565e11b3023a639589f2734488029f194e2f9d
    (cherry picked from commit 822ad5f06bcef8f95f36032d4fd4709975cecc31)

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b2

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.2

This issue was fixed in the openstack/neutron 7.0.2 release.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

It's at least Medium, maybe High since it hits our scalability really hard.

Changed in neutron:
importance: Low → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.