addFloatingIp action can fail if re-associating a floating IP to an instance in another cell

Bug #1826472 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

Ran into this with the new nova-multi-cell job:

http://logs.openstack.org/22/655222/3/check/nova-multi-cell/c1ba7e0/controller/logs/screen-n-api.txt.gz#_Apr_25_20_35_15_249786

Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips [None req-3b6da167-cd81-4f88-a27b-05876cf4e6f4 tempest-FloatingIPsAssociationTestJSON-110201909 tempest-FloatingIPsAssociationTestJSON-110201909] Unable to associate floating IP 172.24.5.15 to fixed IP 10.1.0.10 for instance be0ea845-25df-4528-8fcf-c27835f41636. Error: Instance 85ce6664-7b76-460a-88fe-5fd05b7c34bc could not be found.: nova.exception.InstanceNotFound: Instance 85ce6664-7b76-460a-88fe-5fd05b7c34bc could not be found.
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips Traceback (most recent call last):
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/api/openstack/compute/floating_ips.py", line 266, in _add_floating_ip
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips fixed_address=fixed_address)
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/network/base_api.py", line 83, in wrapper
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips res = f(self, context, *args, **kwargs)
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2401, in associate_floating_ip
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips orig_instance_uuid)
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/usr/local/lib/python3.6/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips result = fn(cls, context, *args, **kwargs)
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/objects/instance.py", line 505, in get_by_uuid
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips use_slave=use_slave)
Apr 25 20:35:15.249786 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 213, in wrapper
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips return f(*args, **kwargs)
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/objects/instance.py", line 497, in _db_instance_get_by_uuid
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips columns_to_join=columns_to_join)
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/db/api.py", line 758, in instance_get_by_uuid
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips return IMPL.instance_get_by_uuid(context, uuid, columns_to_join)
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 171, in wrapper
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips return f(*args, **kwargs)
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 258, in wrapped
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips return f(context, *args, **kwargs)
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1827, in instance_get_by_uuid
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips columns_to_join=columns_to_join)
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1836, in _instance_get_by_uuid
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips raise exception.InstanceNotFound(instance_id=uuid)
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips nova.exception.InstanceNotFound: Instance 85ce6664-7b76-460a-88fe-5fd05b7c34bc could not be found.
Apr 25 20:35:15.251179 ubuntu-bionic-rax-ord-0005726254 <email address hidden>[19571]: ERROR nova.api.openstack.compute.floating_ips

Due to this old code:

https://github.com/openstack/nova/blob/a991980863f056323c1ee9fd6a46dbc4cb899eca/nova/network/neutronv2/api.py#L2400

https://review.opendev.org/#/c/33054/

The idea there is if you're associating a floating IP from one server to another, we try to refresh the network info cache on the instance that previously had the floating IP so it's not stale and you have 2 instances reporting they have the same floating IP until the cache is cleaned up.

The problem is in a multi-cell environment the servers could be in different cells and the context is targeted at the cell we're assigning the floating IP *to* but will fail to lookup the instance in the other cell that the floating IP is coming *from* (the original instance).

This results in a 400 error in the API (I'm surprised it's not returning a 404 or 500 actually):

https://github.com/openstack/nova/blob/a991980863f056323c1ee9fd6a46dbc4cb899eca/nova/api/openstack/compute/floating_ips.py#L282

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not really sure what we should do in this case. We could capture the InstanceNotFound here:

https://github.com/openstack/nova/blob/a991980863f056323c1ee9fd6a46dbc4cb899eca/nova/network/neutronv2/api.py#L2400

And arguably should have been doing that anyway, but then do we ignore it as a best-effort kind of thing? That would mean the info cache on the original instance (if it does exist in another cell) would be stale until the _heal_instance_info_cache periodic runs in that other cell.

We could handle the InstanceNotFound and attempt to lookup the InstanceMapping for that instance and if found, target the cell it's in to get the instance and refresh it's cache. It looks like that should work in the neutron case because associate_floating_ip is only called from the API so it'd have access to the API DB to lookup the InstanceMapping.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/656594

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/678301

Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/678301
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9cbedea0bf5535fc45d78c80e6d7499512e4c5f5
Submitter: Zuul
Branch: master

commit 9cbedea0bf5535fc45d78c80e6d7499512e4c5f5
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 23 16:03:41 2019 -0400

    Trap and log errors from _update_inst_info_cache_for_disassociated_fip

    When associating a floating IP to instance A and the floating IP is
    already associated to instance B, the associate_floating_ip method
    updates the floating IP to be associated with instance A then tries
    to update the network info cache for instance B. That network info
    cache update is best effort and could fail in different ways, e.g.
    the original associated port or instance could be gone. Failing to
    refresh the cache for instance B should not affect the association
    operation for instance A, so this change traps any errors during the
    refresh and just logs them as a warning.

    Change-Id: Ib5a44e4fd2ec2bf43b761db29403810d8b730429
    Related-Bug: #1826472

Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/656594
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=481cb5ce04e220c26e5772f4253d63e212adca45
Submitter: Zuul
Branch: master

commit 481cb5ce04e220c26e5772f4253d63e212adca45
Author: Matt Riedemann <email address hidden>
Date: Tue Apr 30 17:45:32 2019 -0400

    Find instance in another cell during floating IP re-association

    When associating a floating IP to instance A but it was already
    associated with instance B, we try to refresh the info cache on
    instance B. The problem is the context is targeted to the cell
    for instance A and instance B might be in another cell, so we'll
    get an InstanceNotFound error trying to lookup instance B.

    This change tries to find the instance in another cell using its
    instance mapping, and makes the code a bit more graceful if
    instance B is deleted.

    Change-Id: I71790afd0784d98050ccd7cc0e046321da249cbe
    Closes-Bug: #1826472

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/682182

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/682183

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/682182
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=38fc7f6f1602640b8e8c7b9cfa3e48766a5c5b95
Submitter: Zuul
Branch: stable/stein

commit 38fc7f6f1602640b8e8c7b9cfa3e48766a5c5b95
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 23 16:03:41 2019 -0400

    Trap and log errors from _update_inst_info_cache_for_disassociated_fip

    When associating a floating IP to instance A and the floating IP is
    already associated to instance B, the associate_floating_ip method
    updates the floating IP to be associated with instance A then tries
    to update the network info cache for instance B. That network info
    cache update is best effort and could fail in different ways, e.g.
    the original associated port or instance could be gone. Failing to
    refresh the cache for instance B should not affect the association
    operation for instance A, so this change traps any errors during the
    refresh and just logs them as a warning.

    Change-Id: Ib5a44e4fd2ec2bf43b761db29403810d8b730429
    Related-Bug: #1826472
    (cherry picked from commit 9cbedea0bf5535fc45d78c80e6d7499512e4c5f5)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/682183
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4adf563da0d7d8421df074b51773e3d7c65ae4a7
Submitter: Zuul
Branch: stable/stein

commit 4adf563da0d7d8421df074b51773e3d7c65ae4a7
Author: Matt Riedemann <email address hidden>
Date: Tue Apr 30 17:45:32 2019 -0400

    Find instance in another cell during floating IP re-association

    When associating a floating IP to instance A but it was already
    associated with instance B, we try to refresh the info cache on
    instance B. The problem is the context is targeted to the cell
    for instance A and instance B might be in another cell, so we'll
    get an InstanceNotFound error trying to lookup instance B.

    This change tries to find the instance in another cell using its
    instance mapping, and makes the code a bit more graceful if
    instance B is deleted.

    Conflicts:
          devstack/nova-multi-cell-blacklist.txt

    NOTE(mriedem): The conflict is because the nova-multi-cell
    job does not exist in Stein.

    Change-Id: I71790afd0784d98050ccd7cc0e046321da249cbe
    Closes-Bug: #1826472
    (cherry picked from commit 481cb5ce04e220c26e5772f4253d63e212adca45)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.3

This issue was fixed in the openstack/nova 19.0.3 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.