Connectivity to instance after L3 router migration from Legacy to HA fails

Bug #1785582 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Slawek Kaplonski

Bug Description

Scenario test neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha
fails because of no connectivity to VM after migration.
We observed it on Pike version mostly but I think that the same issue might be also in newer versions.

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 68, in test_from_legacy_to_ha
    after_dvr=False, after_ha=True)
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 55, in _test_migration
    self._check_connectivity()
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_dvr.py", line 29, in _check_connectivity
    self.keypair['private_key'])
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/base.py", line 204, in check_connectivity
    ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.224 via SSH timed out.
User: cirros, Password: None

From my investigation it looks that it is because of race between two different operations on router.

1. router is switched to admin_state down, so port is set to DOWN also,
2. neutron-server got info from ovs agent that port is down
3. but now, other thread changes router from legacy to ha so owner of this port changes from DEVICE_OWNER_ROUTER_INTF to DEVICE_OWNER_HA_REPLICATED_INT and also router is still "on" this host (as it's now backup node for router) so in https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L258 l2pop says: ok, I'm not sending remove_fdb_entries to this mac address on this port and old entries are still on other nodes :/ because later when this port is up on different host (new master node) add_fdb_entries is also not send to hosts because of https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L307 which was added in https://github.com/openstack/neutron/commit/26d8702b9d7cc5a4293b97bc435fa85983be9f01

I tried to run this tests with waiting until router's port will be really down before calling migration to HA and then it passed 151 times for me. So it clearly shows that this is an issue here.
I think that it should be fixed in neutron's code instead of test as this isn't test-only issue.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/589412

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/589885

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/589885
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6c300b1a9b3f0db82b4edd84eda74600d28b7185
Submitter: Zuul
Branch: master

commit 6c300b1a9b3f0db82b4edd84eda74600d28b7185
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 8 14:52:06 2018 +0200

    Remove fdb entries for ha router interfaces when going DOWN

    When HA router's interface on host is going DOWN but router
    is still available on this host, L2 population
    mechanism driver will now send to other hosts info to remove
    fdb unicast entries to this port on host.

    It will not send FLOODING_ENTRY because this port is still on
    host but in standby mode and might be transformed to master
    in future.

    This solves issue with migration router from Legacy to HA.
    In such case, port which was originally attached to legacy
    router is transformed to be HA backup port before changing
    its status to DOWN.
    Now in such case unicast entries to this port and backup
    node will be removed properly so packets to HA router will
    be really send to host which is master node for router.

    Closes-Bug: #1785582

    Change-Id: Icc14e5f5d40fc6fbb49e0f7b18cc3b15ebec8508

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/596680

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/596681

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/596682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/596685

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/596681
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=23154202276385d2386cd6c4d113104ec422c9a1
Submitter: Zuul
Branch: stable/queens

commit 23154202276385d2386cd6c4d113104ec422c9a1
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 8 14:52:06 2018 +0200

    Remove fdb entries for ha router interfaces when going DOWN

    When HA router's interface on host is going DOWN but router
    is still available on this host, L2 population
    mechanism driver will now send to other hosts info to remove
    fdb unicast entries to this port on host.

    It will not send FLOODING_ENTRY because this port is still on
    host but in standby mode and might be transformed to master
    in future.

    This solves issue with migration router from Legacy to HA.
    In such case, port which was originally attached to legacy
    router is transformed to be HA backup port before changing
    its status to DOWN.
    Now in such case unicast entries to this port and backup
    node will be removed properly so packets to HA router will
    be really send to host which is master node for router.

    Closes-Bug: #1785582

    Change-Id: Icc14e5f5d40fc6fbb49e0f7b18cc3b15ebec8508
    (cherry picked from commit 6c300b1a9b3f0db82b4edd84eda74600d28b7185)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/596680
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7908ca6c5872b2d4d5b7bbdc6033ecbf97c3d30c
Submitter: Zuul
Branch: stable/rocky

commit 7908ca6c5872b2d4d5b7bbdc6033ecbf97c3d30c
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 8 14:52:06 2018 +0200

    Remove fdb entries for ha router interfaces when going DOWN

    When HA router's interface on host is going DOWN but router
    is still available on this host, L2 population
    mechanism driver will now send to other hosts info to remove
    fdb unicast entries to this port on host.

    It will not send FLOODING_ENTRY because this port is still on
    host but in standby mode and might be transformed to master
    in future.

    This solves issue with migration router from Legacy to HA.
    In such case, port which was originally attached to legacy
    router is transformed to be HA backup port before changing
    its status to DOWN.
    Now in such case unicast entries to this port and backup
    node will be removed properly so packets to HA router will
    be really send to host which is master node for router.

    Closes-Bug: #1785582

    Change-Id: Icc14e5f5d40fc6fbb49e0f7b18cc3b15ebec8508
    (cherry picked from commit 6c300b1a9b3f0db82b4edd84eda74600d28b7185)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/596685
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5c71b598daf1872859a12755eb97576dc3a81ff5
Submitter: Zuul
Branch: stable/ocata

commit 5c71b598daf1872859a12755eb97576dc3a81ff5
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 8 14:52:06 2018 +0200

    Remove fdb entries for ha router interfaces when going DOWN

    When HA router's interface on host is going DOWN but router
    is still available on this host, L2 population
    mechanism driver will now send to other hosts info to remove
    fdb unicast entries to this port on host.

    It will not send FLOODING_ENTRY because this port is still on
    host but in standby mode and might be transformed to master
    in future.

    This solves issue with migration router from Legacy to HA.
    In such case, port which was originally attached to legacy
    router is transformed to be HA backup port before changing
    its status to DOWN.
    Now in such case unicast entries to this port and backup
    node will be removed properly so packets to HA router will
    be really send to host which is master node for router.

    Closes-Bug: #1785582

    Change-Id: Icc14e5f5d40fc6fbb49e0f7b18cc3b15ebec8508
    (cherry picked from commit 6c300b1a9b3f0db82b4edd84eda74600d28b7185)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/589412
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f41287b6c895fd1bc666e87701007a882a691662
Submitter: Zuul
Branch: stable/pike

commit f41287b6c895fd1bc666e87701007a882a691662
Author: Slawek Kaplonski <email address hidden>
Date: Tue Aug 7 11:47:11 2018 +0200

    Wait until all router ports are DOWN before migration

    In router migration tests, before migration is started, router
    is set to admin_state_up=False. This cause that status of all
    router ports should be set to DOWN.
    This patch adds check (and wait) that all ports are really set
    to DOWN state before migration of router is started.

    Change-Id: I93db6f67a74f753eaad0900e8045d4676dd1337c
    Related-Bug: #1785582

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.1

This issue was fixed in the openstack/neutron 13.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/596682
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5be39509655198e152be835218deef6e9dde05db
Submitter: Zuul
Branch: stable/pike

commit 5be39509655198e152be835218deef6e9dde05db
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 8 14:52:06 2018 +0200

    Remove fdb entries for ha router interfaces when going DOWN

    When HA router's interface on host is going DOWN but router
    is still available on this host, L2 population
    mechanism driver will now send to other hosts info to remove
    fdb unicast entries to this port on host.

    It will not send FLOODING_ENTRY because this port is still on
    host but in standby mode and might be transformed to master
    in future.

    This solves issue with migration router from Legacy to HA.
    In such case, port which was originally attached to legacy
    router is transformed to be HA backup port before changing
    its status to DOWN.
    Now in such case unicast entries to this port and backup
    node will be removed properly so packets to HA router will
    be really send to host which is master node for router.

    Closes-Bug: #1785582

    Change-Id: Icc14e5f5d40fc6fbb49e0f7b18cc3b15ebec8508
    (cherry picked from commit 6c300b1a9b3f0db82b4edd84eda74600d28b7185)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.4

This issue was fixed in the openstack/neutron 12.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b1

This issue was fixed in the openstack/neutron 14.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.