Port device owner isn't updated with new host availability zone during unshelve

Bug #1759924 reported by Mike Lowe on 2018-03-29
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Ocata
Low
Matt Riedemann
Pike
Low
Matt Riedemann
Queens
Low
Matt Riedemann
Rocky
Low
Matt Riedemann

Bug Description

During an unshelve the host for an instance and therefor the availability zone may change but does not seem to updated in the port's device_owner causing problems with server action add fixed ip for example.

In nova/network/neutronv2/api.py _update_port_binding_for_instance should probably update the port's device_owner the same way that _update_ports_for_instance does.

+-----------------------+--------------------------------------+
| Field | Value |
+-----------------------+--------------------------------------+
| admin_state_up | UP |
| allowed_address_pairs | |
| binding_host_id | r02c4b15 |
| binding_profile | |
| binding_vif_details | port_filter='True' |
| binding_vif_type | bridge |
| binding_vnic_type | normal |
| created_at | 2018-03-05T13:25:48Z |
| data_plane_status | None |
| description | |
| device_id | 53f04bf3-eb1f-4c64-a70f-fd16d6c1a5af |
| device_owner | compute:zone-r7 |
| dns_assignment | |
| dns_name | instance-w-volume-shelving-test |
| extra_dhcp_opts | |
| fixed_ips | |
| id | 327b891f-1820-4aa9-bbc3-fe9cc619eac3 |
| ip_address | None |
| mac_address | fa:16:3e:14:21:d1 |
| name | |
| network_id | e73b1699-0129-4c12-b722-e6ce52604824 |
| option_name | None |
| option_value | None |
| port_security_enabled | False |
| project_id | ecf32b152563403bbde297f58f4637d4 |
| qos_policy_id | None |
| revision_number | 19 |
| security_group_ids | bb25a73a-a62e-4015-9595-16add6b7d3a0 |
| status | ACTIVE |
| subnet_id | None |
| tags | |
| trunk_details | None |
| updated_at | 2018-03-28T20:03:23Z |
+-----------------------+--------------------------------------+

nova show 53f04bf3-eb1f-4c64-a70f-fd16d6c1a5af
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | zone-r2 |
| OS-EXT-SRV-ATTR:host | r02c4b15 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | r02c4b15

Matt Riedemann (mriedem) on 2018-03-29
Changed in nova:
status: New → Confirmed
status: Confirmed → Triaged
importance: Undecided → Medium
tags: added: neutr shelve
tags: added: neutron
removed: neutr
Matt Riedemann (mriedem) wrote :

I think this was the scenario described in IRC for this bug:

1. launch instance, it's in az zone-r7
2. shelve the instance
3. unshelve the instance - the scheduler now puts it in zone-r2
4. add a fixed IP to the instance

This fails because the addFixedIP flow is looking up ports for the instance by it's current zone and fails to find any because the ports still have the old zone in their device_owner field.

Looking at shelve offload, we could cleanup the device_owner and port binding information here:

https://github.com/openstack/nova/blob/144d621397c6a4065dec9773dc7441d9badc8f03/nova/compute/manager.py#L4837

That cleanup_instance_network_on_host method is currently just a no-op when using neutron.

If we look at unshelve, we call this method to update the port bindings for the host to point at the new host:

https://github.com/openstack/nova/blob/144d621397c6a4065dec9773dc7441d9badc8f03/nova/compute/manager.py#L4951

That gets into this code to change the port's binding:host_id value to the new host after the unshelve:

https://github.com/openstack/nova/blob/144d621397c6a4065dec9773dc7441d9badc8f03/nova/network/neutronv2/api.py#L2576

At that point, we likely should also change the device_owner on the port, like is done here:

https://github.com/openstack/nova/blob/144d621397c6a4065dec9773dc7441d9badc8f03/nova/network/neutronv2/api.py#L1031-L1033

Matt Riedemann (mriedem) wrote :

Also note that when we shelve offload, we aren't doing any of this cleanup:

https://github.com/openstack/nova/blob/144d621397c6a4065dec9773dc7441d9badc8f03/nova/network/neutronv2/api.py#L553-L561

So if instance has pci devices associated with the port, or a DNS name in Designate, that's not getting cleaned up. We don't have to make the cleanup on shelve offload changes in the same patch to fix this bug on unshelve, but it's likely something we should do related to this bug. But I haven't gotten an answer to this question from the neutron team:

(12:53:38 PM) mriedem: random question: would anything bad happen if nova leaves a port attached to an instance (attached = port.device_id == server.id), but clears the port's binding:host_id and device_owner?
(12:54:06 PM) mriedem: nova would also be unplugging vifs from the host when doing this

Matt Riedemann (mriedem) wrote :

I think I should be able to reproduce this with a single-host devstack by doing something like:

1. create a host aggregate / AZ with the single host in it
2. create VM, assert the port has the device_owner with the AZ name in it and it matches the instance AZ
3. shelve (offload) the VM, assert the port's device_owner isn't changed even though the port's host binding details are cleared
4. change the host aggregate / AZ name to something new
5. unshelve the VM
6. assert the port's host binding information is updated but the device_owner still points at the old AZ name (but the instance AZ is updated)

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) wrote :

Based on the steps in comment 3, I have a recreate with steps and output here:

http://paste.openstack.org/show/718769/

Matt Riedemann (mriedem) wrote :

Per comment 2 on cleaning up the port when we shelve offload, it sounds like that should be OK to do:

(2:30:17 PM) mriedem: mlavalle: while working on bug 1759924 - i'm trying to determine if neutron would have any issues with a port attached to a vm where the binding host and device_owner are unset?
(2:30:19 PM) openstack: bug 1759924 in OpenStack Compute (nova) "Port device owner isn't updated with new host availability zone during unshelve" [Medium,Triaged] https://launchpad.net/bugs/1759924 - Assigned to Matt Riedemann (mriedem)
(2:30:44 PM) mriedem: can you think of anything off the top of your head that might break there?
(2:30:58 PM) mriedem: this is when an instance is shelved, meaning there is no actual guest running in a hypervisor
(2:34:07 PM) mlavalle: mriedem: in that situation the port is not bound so it doesn't exist in a host. Am I understanding correctly?
(2:34:44 PM) mriedem: correct, nova would unplug the vif
(2:34:45 PM) mriedem: during shelve
(2:34:53 PM) mriedem: the port is still logically attached to the instance in the db
(2:34:58 PM) mriedem: but there is nothing in the data plane
(2:35:17 PM) mlavalle: I don't think it would be a problem. it's just and unbound port, a row in a DB table
(2:35:19 PM) openstackgerrit: Merged openstack/networking-bagpipe master: bagpipe-bgp: IPVPN MPLS OVS driver, silently ignore re-removal https://review.openstack.org/559658
(2:35:25 PM) mriedem: mlavalle: ok, cool, thanks

But it could be done separately from the unshelve fix to update the device_owner.

Fix proposed to branch: master
Review: https://review.openstack.org/559828

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/559828
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=93802619adde69bf84d26d7231480abb4da07c91
Submitter: Zuul
Branch: master

commit 93802619adde69bf84d26d7231480abb4da07c91
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 9 16:12:19 2018 -0400

    Update port device_owner when unshelving

    When we shelve offload an instance, we unplug VIFs, delete
    the guest from the compute host, etc. The instance is still
    logically attached to its ports but they aren't connected
    on any host.

    When we unshelve an instance, it is scheduled and created
    on a potentially new host, in a potentially new availability
    zone. During unshelve, the compute manager will call the
    setup_instance_network_on_host() method to update the port
    host binding information for the new host, but was not
    updating the device_owner, which reflects the availability
    zone that the instance is in. Because of this, an instance
    can be created in az1, shelved, and then unshelved in az2
    but the port device_owner still says az1 even though the
    port host binding is for a compute host in az2.

    This change simply updates the port device_owner when
    updating the port binding host during unshelve.

    A TODO is left in the cleanup_instance_network_on_host()
    method which is called during shelve offload but is currently
    not implemented. We should unbind ports when shelve offloading,
    but that is a bit of a bigger change and left for a separate
    patch since it is not technically needed for this bug fix.

    Change-Id: Ibd1cbe0e9b5cf3ede542dbf62b1a7d503ba7ea06
    Closes-Bug: #1759924

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/626407
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3e38d1cf16ca29948be499aa37e2494ffd001f12
Submitter: Zuul
Branch: stable/rocky

commit 3e38d1cf16ca29948be499aa37e2494ffd001f12
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 9 16:12:19 2018 -0400

    Update port device_owner when unshelving

    When we shelve offload an instance, we unplug VIFs, delete
    the guest from the compute host, etc. The instance is still
    logically attached to its ports but they aren't connected
    on any host.

    When we unshelve an instance, it is scheduled and created
    on a potentially new host, in a potentially new availability
    zone. During unshelve, the compute manager will call the
    setup_instance_network_on_host() method to update the port
    host binding information for the new host, but was not
    updating the device_owner, which reflects the availability
    zone that the instance is in. Because of this, an instance
    can be created in az1, shelved, and then unshelved in az2
    but the port device_owner still says az1 even though the
    port host binding is for a compute host in az2.

    This change simply updates the port device_owner when
    updating the port binding host during unshelve.

    A TODO is left in the cleanup_instance_network_on_host()
    method which is called during shelve offload but is currently
    not implemented. We should unbind ports when shelve offloading,
    but that is a bit of a bigger change and left for a separate
    patch since it is not technically needed for this bug fix.

    Change-Id: Ibd1cbe0e9b5cf3ede542dbf62b1a7d503ba7ea06
    Closes-Bug: #1759924
    (cherry picked from commit 93802619adde69bf84d26d7231480abb4da07c91)

Reviewed: https://review.openstack.org/626408
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=245364ece1689853bf33ac25d7319618d064d909
Submitter: Zuul
Branch: stable/queens

commit 245364ece1689853bf33ac25d7319618d064d909
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 9 16:12:19 2018 -0400

    Update port device_owner when unshelving

    When we shelve offload an instance, we unplug VIFs, delete
    the guest from the compute host, etc. The instance is still
    logically attached to its ports but they aren't connected
    on any host.

    When we unshelve an instance, it is scheduled and created
    on a potentially new host, in a potentially new availability
    zone. During unshelve, the compute manager will call the
    setup_instance_network_on_host() method to update the port
    host binding information for the new host, but was not
    updating the device_owner, which reflects the availability
    zone that the instance is in. Because of this, an instance
    can be created in az1, shelved, and then unshelved in az2
    but the port device_owner still says az1 even though the
    port host binding is for a compute host in az2.

    This change simply updates the port device_owner when
    updating the port binding host during unshelve.

    A TODO is left in the cleanup_instance_network_on_host()
    method which is called during shelve offload but is currently
    not implemented. We should unbind ports when shelve offloading,
    but that is a bit of a bigger change and left for a separate
    patch since it is not technically needed for this bug fix.

    Change-Id: Ibd1cbe0e9b5cf3ede542dbf62b1a7d503ba7ea06
    Closes-Bug: #1759924
    (cherry picked from commit 93802619adde69bf84d26d7231480abb4da07c91)
    (cherry picked from commit 3e38d1cf16ca29948be499aa37e2494ffd001f12)

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

This issue was fixed in the openstack/nova 17.0.10 release.

This issue was fixed in the openstack/nova 18.2.0 release.

Reviewed: https://review.openstack.org/626409
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c7bb9b1652b6df3e0a353a5c9f4cf70299c4e5e7
Submitter: Zuul
Branch: stable/pike

commit c7bb9b1652b6df3e0a353a5c9f4cf70299c4e5e7
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 9 16:12:19 2018 -0400

    Update port device_owner when unshelving

    When we shelve offload an instance, we unplug VIFs, delete
    the guest from the compute host, etc. The instance is still
    logically attached to its ports but they aren't connected
    on any host.

    When we unshelve an instance, it is scheduled and created
    on a potentially new host, in a potentially new availability
    zone. During unshelve, the compute manager will call the
    setup_instance_network_on_host() method to update the port
    host binding information for the new host, but was not
    updating the device_owner, which reflects the availability
    zone that the instance is in. Because of this, an instance
    can be created in az1, shelved, and then unshelved in az2
    but the port device_owner still says az1 even though the
    port host binding is for a compute host in az2.

    This change simply updates the port device_owner when
    updating the port binding host during unshelve.

    A TODO is left in the cleanup_instance_network_on_host()
    method which is called during shelve offload but is currently
    not implemented. We should unbind ports when shelve offloading,
    but that is a bit of a bigger change and left for a separate
    patch since it is not technically needed for this bug fix.

    Change-Id: Ibd1cbe0e9b5cf3ede542dbf62b1a7d503ba7ea06
    Closes-Bug: #1759924
    (cherry picked from commit 93802619adde69bf84d26d7231480abb4da07c91)
    (cherry picked from commit 3e38d1cf16ca29948be499aa37e2494ffd001f12)
    (cherry picked from commit 245364ece1689853bf33ac25d7319618d064d909)

Reviewed: https://review.openstack.org/626413
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0b3e45387ce1979b394b38d6e124e001e2898c1a
Submitter: Zuul
Branch: stable/ocata

commit 0b3e45387ce1979b394b38d6e124e001e2898c1a
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 9 16:12:19 2018 -0400

    Update port device_owner when unshelving

    When we shelve offload an instance, we unplug VIFs, delete
    the guest from the compute host, etc. The instance is still
    logically attached to its ports but they aren't connected
    on any host.

    When we unshelve an instance, it is scheduled and created
    on a potentially new host, in a potentially new availability
    zone. During unshelve, the compute manager will call the
    setup_instance_network_on_host() method to update the port
    host binding information for the new host, but was not
    updating the device_owner, which reflects the availability
    zone that the instance is in. Because of this, an instance
    can be created in az1, shelved, and then unshelved in az2
    but the port device_owner still says az1 even though the
    port host binding is for a compute host in az2.

    This change simply updates the port device_owner when
    updating the port binding host during unshelve.

    A TODO is left in the cleanup_instance_network_on_host()
    method which is called during shelve offload but is currently
    not implemented. We should unbind ports when shelve offloading,
    but that is a bit of a bigger change and left for a separate
    patch since it is not technically needed for this bug fix.

    Change-Id: Ibd1cbe0e9b5cf3ede542dbf62b1a7d503ba7ea06
    Closes-Bug: #1759924
    (cherry picked from commit 93802619adde69bf84d26d7231480abb4da07c91)
    (cherry picked from commit 3e38d1cf16ca29948be499aa37e2494ffd001f12)
    (cherry picked from commit 245364ece1689853bf33ac25d7319618d064d909)
    (cherry picked from commit c7bb9b1652b6df3e0a353a5c9f4cf70299c4e5e7)

This issue was fixed in the openstack/nova 16.1.8 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers