Router migrations failing in the gate

Bug #1838449 reported by Miguel Lavalle
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Miguel Lavalle

Bug Description

As of the reporting of this bug, router migrations are the largest contributor of failures of the neutron-tempest-plugin-dvr-multinode-scenario job. Over a 7 days period these are the failures observed:

test_qos_basic_and_update 48
test_from_dvr_to_dvr_ha 39
test_from_dvr_to_ha 38
test_from_dvr_to_legacy 23
test_connectivity_through_2_routers 17
test_snat_external_ip 17
test_vm_reachable_through_compute 10
test_trunk_subport_lifecycle 9
test_qos 8

Specifically, the migrations from dvr to something else seem to be the problem

Miguel Lavalle (minsel)
Changed in neutron:
importance: Undecided → Medium
assignee: nobody → Miguel Lavalle (minsel)
description: updated
Changed in neutron:
status: New → Confirmed
Revision history for this message
Miguel Lavalle (minsel) wrote :

Looking at the logs of this failure I notice the following:

1) The problem happens when the test sets the router admin_state_up to False and waits for the service port to go down.

2) The network:router_interface_distributed port doesn't change status to DOWN and the test fails

3) I can see (in ovs agent) the network:router_interface_distributed port removed from the node where the router was scheduled, but the neutron server never changes it to DOWN

Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
Miguel Lavalle (minsel) wrote :

Running successfully test_from_dvr_to_dvr_ha locally in my dev environment I can see the following sequence of events when the L3 agents remove the router interfaces as result of setting the router admin_state_up to False: http://paste.openstack.org/show/755500/

Next step is to check how this sequence of events takes place in the check queue when test_from_dvr_to_dvr_ha is executed

tags: added: gate-failure l3-dvr-backlog
tags: removed: gate-failure
tags: added: gate-failure
Revision history for this message
Miguel Lavalle (minsel) wrote :

Looking at this failure http://logs.openstack.org/97/660797/5/check/neutron-tempest-plugin-dvr-multinode-scenario/7f25585/testr_results.html.gz, we can see that it is caused because the test timed out waiting for the status of "network:router_interface_distributed" port to be set to DOWN. This is the test code involved is https://github.com/openstack/neutron-tempest-plugin/blob/99684ac97eb1ad1e0be6fd824ba00ee707421e0f/neutron_tempest_plugin/scenario/test_migration.py#L70-L81

Looking at the sequence of events in the L3 and L2 agents involved, we can see that the L3 agent took more than five minutes to unplug the "network:router_interface_distributed" interface: http://paste.openstack.org/show/756780/

Nest step is to see why L3 agent is taking so long

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/677098

Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
Miguel Lavalle (minsel) wrote :

Describing first what "SHOULD HAPPEN" in a successful run of test_from_dvr_to_dvr_ha (see http://paste.openstack.org/show/769794/):

1) Test sets router "admin_state_up": false

2) L3 agents in controller and compute1 receive the router update notification and delete the router locally

In a failed execution, this is the observed sequence of events (see http://paste.openstack.org/show/769795/):

1) Test sets router "admin_state_up": false

2) L3 agents in compute1 (in this case) receives the router update notification and delete the router locally

3) L3 agent in controller receives an update router notification with related routers. The agent doesn't delete locally the router and queues for processing the related routers. As a consequence, the router ports are not set to status DOWN and the test case times out

I have observed this pattern several times.

My main suspect at this point in time is https://github.com/openstack/neutron/blob/78aae12a88e8b3cc0609c830527533b8a8a92d60/neutron/db/l3_dvrscheduler_db.py#L141-L153. This code was last modified in rapid succession by patches https://review.opendev.org/#/c/664525 and https://review.opendev.org/#/c/661522. It was originally created by https://review.opendev.org/#/c/597567

Revision history for this message
Miguel Lavalle (minsel) wrote :

After many re-checks, I haven't been enable to reproduce the bug with: https://review.opendev.org/#/c/677098. In the meantime, I analyzed a correct execution in this annotated trace: http://paste.openstack.org/show/780174/

Revision history for this message
Miguel Lavalle (minsel) wrote :

In regards to previous comment, this is how the router synch data is retrieved:

Sep 22 23:23:03.184754 ubuntu-bionic-fortnebula-regionone-0011677831 neutron-server[16342]: DEBUG neutron.db.l3_dvr_db [None req-2cb213b2-ce5b-472c-8f98-fc3675707464 None None] neutron.services.l3_router.l3_router_plugin.L3RouterPlugin method _get_dvr_sync_data called with arguments (<neutron_lib.context.Context object at 0x7f7bd1117438>, 'ubuntu-bionic-fortnebula-regionone-0011677831', Agent(admin_state_up=True,agent_type='L3 agent',availability_zone='nova',binary='neutron-l3-agent',configurations={"agent_mode": "dvr_snat", "ex_gw_ports": 7, "floating_ips": 4, "handle_internal_only_routers": true, "interface_driver": "openvswitch", "interfaces": 10, "log_agent_heartbeats": false, "routers": 7},created_at=2019-09-22T23:00:42Z,description=<?>,heartbeat_timestamp=2019-09-22T23:22:42Z,host='ubuntu-bionic-fortnebula-regionone-0011677831',id=21b82ccc-b613-4266-abad-fe3c55314b48,load=0,resource_versions=<?>,resources_synced=<?>,started_at=2019-09-22T23:00:42Z,topic='l3_agent'), ['47a50800-c88d-454b-bd42-4607da097fb9'], True)

Revision history for this message
Miguel Lavalle (minsel) wrote :

More detailed debugging notes:

http://paste.openstack.org/show/780568/

Revision history for this message
Miguel Lavalle (minsel) wrote :

Notes in #11 are for the successful case.

Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/677098
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=12760c94c89f654885d0432a4da37110c112be1a
Submitter: Zuul
Branch: master

commit 12760c94c89f654885d0432a4da37110c112be1a
Author: Miguel Lavalle <email address hidden>
Date: Sun Aug 18 20:27:15 2019 -0500

    Adding LOG statements to debug 1838449

    Adding LOG statements to debug 1838449

    Change-Id: I6f9cbe2e6b4ea0f122f5f8318dbbc31fce6b61f4
    Related-Bug: #1838449

Revision history for this message
Miguel Lavalle (minsel) wrote :

The sequence of events that triggers this bug is the following:

1) A router R is under migration and resides in the same L3 agent that has two connected routers created by https://github.com/openstack/neutron-tempest-plugin/blob/0749b65aebf9a320f271f213f621a8888e17070d/neutron_tempest_plugin/scenario/test_connectivity.py#L66. R is not connected to those routers.

2) The migration scenario test sets R's admin_state_up to False prior to the migration: https://github.com/openstack/neutron-tempest-plugin/blob/0749b65aebf9a320f271f213f621a8888e17070d/neutron_tempest_plugin/scenario/test_migration.py#L134-L135

3) Previous point triggers a router update message being sent to the L3 agent, which sends a router R synch request to the Neutron server.

4) As part of the router R synch request processing, the Neutron server calls https://github.com/openstack/neutron-tempest-plugin/blob/0749b65aebf9a320f271f213f621a8888e17070d/neutron_tempest_plugin/scenario/test_migration.py#L134-L135. Note that these lines are going to add to the set of routers to be returned the two connected routers, because variable dvr_routers contains the ids of ALL THE ROUTERS in the agent due to these lines: https://github.com/openstack/neutron/blob/b7c3a99473349f5bcc678a35615e68f21a294208/neutron/db/l3_dvrscheduler_db.py#L429-L430

5) As a consequence, the L3 agent never deletes router R, because it receives data for two "connected" routers. This is an example:

Oct 01 16:03:00.263987 ubuntu-bionic-rax-iad-0012068845 neutron-l3-agent[18672]: DEBUG neutron.agent.l3.agent [-] Queued a router update for ff598bb6-0333-410a-b9e0-590acbe0e83a (related router e35280fd-2b22-4a3c-84b4-016fad2cd905). Original event action 3, priority 1. New event action 4, priority 0 {{(pid=18672) _process_routers_if_compatible /opt/stack/neutron/neutron/agent/l3/agent.py:759}}

Oct 01 16:03:00.263987 ubuntu-bionic-rax-iad-0012068845 neutron-l3-agent[18672]: DEBUG neutron.agent.l3.agent [-] Queued a router update for ffe2ade3-565b-4150-ab33-9fd549e15cea (related router e35280fd-2b22-4a3c-84b4-016fad2cd905). Original event action 3, priority 1. New event action 4, priority 0 {{(pid=18672) _process_routers_if_compatible /opt/stack/neutron/neutron/agent/l3/agent.py:759}}

Oct 01 16:03:00.263987 ubuntu-bionic-rax-iad-0012068845 neutron-l3-agent[18672]: INFO neutron.agent.l3.agent [-] Finished a router update for e35280fd-2b22-4a3c-84b4-016fad2cd905, update_id 75295435-080f-4112-b73b-aec711733bb8. Time elapsed: 22.606

6) The migration scenario test timeouts waiting for the router R's ports to be set to status DOWN: https://github.com/openstack/neutron-tempest-plugin/blob/0749b65aebf9a320f271f213f621a8888e17070d/neutron_tempest_plugin/scenario/test_migration.py#L136

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/691498

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/691498
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=48ea7da6c52ee14f0e9cc244fbc834283a8e74a7
Submitter: Zuul
Branch: master

commit 48ea7da6c52ee14f0e9cc244fbc834283a8e74a7
Author: Miguel Lavalle <email address hidden>
Date: Sat Oct 26 18:43:56 2019 -0500

    Router synch shouldn't return unrelated routers

    [0] introduced the concept of connected routers: routers that are
    connected to the same subnets. When a L3 agent is synching a router
    with connected routers, the data of the entire set should be returned
    to the agent by the Neutron server.

    However, if an agent tries to synch a router with
    no connected routers when the same agent has other routers that are
    connected among them, the Neutron server returns the former and the
    latter. For details of how this bug can manifest itself, please see [1].

    This change prevents this situation: only the synched router is
    returned.

    [0] https://review.opendev.org/#/c/597567
    [1] https://bugs.launchpad.net/neutron/+bug/1838449/comments/15

    Change-Id: Ibbf35d0f4a0bf9281f0bc8c411e8527eed75361d
    Closes-Bug: #1838449

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.0.0.0b1

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/726297

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/726298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/726300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/726301

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/726302

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/726303

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/726305

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/726306

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/726297
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c19b9d8ad6f3574f7b6657db52a5782b28e8739d
Submitter: Zuul
Branch: stable/train

commit c19b9d8ad6f3574f7b6657db52a5782b28e8739d
Author: Miguel Lavalle <email address hidden>
Date: Sun Aug 18 20:27:15 2019 -0500

    Adding LOG statements to debug 1838449

    Adding LOG statements to debug 1838449

    Change-Id: I6f9cbe2e6b4ea0f122f5f8318dbbc31fce6b61f4
    Related-Bug: #1838449
    (cherry picked from commit 12760c94c89f654885d0432a4da37110c112be1a)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/726300
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90d17e4e857c62a788d862da3f5f32fd65806df8
Submitter: Zuul
Branch: stable/stein

commit 90d17e4e857c62a788d862da3f5f32fd65806df8
Author: Miguel Lavalle <email address hidden>
Date: Sun Aug 18 20:27:15 2019 -0500

    Adding LOG statements to debug 1838449

    Adding LOG statements to debug 1838449

    Change-Id: I6f9cbe2e6b4ea0f122f5f8318dbbc31fce6b61f4
    Related-Bug: #1838449
    (cherry picked from commit 12760c94c89f654885d0432a4da37110c112be1a)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/726301
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3ed429019b904aad32b03c262e642effb659351c
Submitter: Zuul
Branch: stable/stein

commit 3ed429019b904aad32b03c262e642effb659351c
Author: Miguel Lavalle <email address hidden>
Date: Sat Oct 26 18:43:56 2019 -0500

    Router synch shouldn't return unrelated routers

    [0] introduced the concept of connected routers: routers that are
    connected to the same subnets. When a L3 agent is synching a router
    with connected routers, the data of the entire set should be returned
    to the agent by the Neutron server.

    However, if an agent tries to synch a router with
    no connected routers when the same agent has other routers that are
    connected among them, the Neutron server returns the former and the
    latter. For details of how this bug can manifest itself, please see [1].

    This change prevents this situation: only the synched router is
    returned.

    [0] https://review.opendev.org/#/c/597567
    [1] https://bugs.launchpad.net/neutron/+bug/1838449/comments/15

    Change-Id: Ibbf35d0f4a0bf9281f0bc8c411e8527eed75361d
    Closes-Bug: #1838449
    (cherry picked from commit 48ea7da6c52ee14f0e9cc244fbc834283a8e74a7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/726302
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fc6b8fc40f0dde7d9f1ffdf19863815eda509e93
Submitter: Zuul
Branch: stable/rocky

commit fc6b8fc40f0dde7d9f1ffdf19863815eda509e93
Author: Miguel Lavalle <email address hidden>
Date: Sun Aug 18 20:27:15 2019 -0500

    Adding LOG statements to debug 1838449

    Adding LOG statements to debug 1838449

    Conflicts:
        neutron/db/l3_dvrscheduler_db.py

    Change-Id: I6f9cbe2e6b4ea0f122f5f8318dbbc31fce6b61f4
    Related-Bug: #1838449
    (cherry picked from commit 12760c94c89f654885d0432a4da37110c112be1a)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/726298
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=25efa34a8a939156694c4bb37ea2400105b21a6b
Submitter: Zuul
Branch: stable/train

commit 25efa34a8a939156694c4bb37ea2400105b21a6b
Author: Miguel Lavalle <email address hidden>
Date: Sat Oct 26 18:43:56 2019 -0500

    Router synch shouldn't return unrelated routers

    [0] introduced the concept of connected routers: routers that are
    connected to the same subnets. When a L3 agent is synching a router
    with connected routers, the data of the entire set should be returned
    to the agent by the Neutron server.

    However, if an agent tries to synch a router with
    no connected routers when the same agent has other routers that are
    connected among them, the Neutron server returns the former and the
    latter. For details of how this bug can manifest itself, please see [1].

    This change prevents this situation: only the synched router is
    returned.

    [0] https://review.opendev.org/#/c/597567
    [1] https://bugs.launchpad.net/neutron/+bug/1838449/comments/15

    Change-Id: Ibbf35d0f4a0bf9281f0bc8c411e8527eed75361d
    Closes-Bug: #1838449
    (cherry picked from commit 48ea7da6c52ee14f0e9cc244fbc834283a8e74a7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/726305
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=72909dffb1c089b751d87c1594fbe8eb75725849
Submitter: Zuul
Branch: stable/queens

commit 72909dffb1c089b751d87c1594fbe8eb75725849
Author: Miguel Lavalle <email address hidden>
Date: Sun Aug 18 20:27:15 2019 -0500

    Adding LOG statements to debug 1838449

    Adding LOG statements to debug 1838449

    Conflicts:
        neutron/api/rpc/handlers/l3_rpc.py
        neutron/db/l3_dvrscheduler_db.py

    Change-Id: I6f9cbe2e6b4ea0f122f5f8318dbbc31fce6b61f4
    Related-Bug: #1838449
    (cherry picked from commit 12760c94c89f654885d0432a4da37110c112be1a)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/726306
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=86efdd03ae14a5a187d79b14e07690e6741917f2
Submitter: Zuul
Branch: stable/queens

commit 86efdd03ae14a5a187d79b14e07690e6741917f2
Author: Miguel Lavalle <email address hidden>
Date: Sat Oct 26 18:43:56 2019 -0500

    Router synch shouldn't return unrelated routers

    [0] introduced the concept of connected routers: routers that are
    connected to the same subnets. When a L3 agent is synching a router
    with connected routers, the data of the entire set should be returned
    to the agent by the Neutron server.

    However, if an agent tries to synch a router with
    no connected routers when the same agent has other routers that are
    connected among them, the Neutron server returns the former and the
    latter. For details of how this bug can manifest itself, please see [1].

    This change prevents this situation: only the synched router is
    returned.

    [0] https://review.opendev.org/#/c/597567
    [1] https://bugs.launchpad.net/neutron/+bug/1838449/comments/15

    Change-Id: Ibbf35d0f4a0bf9281f0bc8c411e8527eed75361d
    Closes-Bug: #1838449
    (cherry picked from commit 48ea7da6c52ee14f0e9cc244fbc834283a8e74a7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/726303
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=57c86fc34fc92cb51eada86c73f66607fd9229ed
Submitter: Zuul
Branch: stable/rocky

commit 57c86fc34fc92cb51eada86c73f66607fd9229ed
Author: Miguel Lavalle <email address hidden>
Date: Sat Oct 26 18:43:56 2019 -0500

    Router synch shouldn't return unrelated routers

    [0] introduced the concept of connected routers: routers that are
    connected to the same subnets. When a L3 agent is synching a router
    with connected routers, the data of the entire set should be returned
    to the agent by the Neutron server.

    However, if an agent tries to synch a router with
    no connected routers when the same agent has other routers that are
    connected among them, the Neutron server returns the former and the
    latter. For details of how this bug can manifest itself, please see [1].

    This change prevents this situation: only the synched router is
    returned.

    [0] https://review.opendev.org/#/c/597567
    [1] https://bugs.launchpad.net/neutron/+bug/1838449/comments/15

    Change-Id: Ibbf35d0f4a0bf9281f0bc8c411e8527eed75361d
    Closes-Bug: #1838449
    (cherry picked from commit 48ea7da6c52ee14f0e9cc244fbc834283a8e74a7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron queens-eol

This issue was fixed in the openstack/neutron queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron rocky-eol

This issue was fixed in the openstack/neutron rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.