[SRU]_heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

Bug #1751923 reported by Maciej Jozefczyk
72
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Maciej Jozefczyk
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Queens
Fix Released
Undecided
Jorge Niedbalski
Rocky
Fix Released
Undecided
Unassigned
Stein
Fix Released
Undecided
Unassigned
nova (Ubuntu)
Fix Released
Medium
Unassigned
Bionic
Fix Released
Medium
Jorge Niedbalski
Disco
Fix Released
Medium
Unassigned

Bug Description

[Impact]

* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.

[Test Plan]

* This bug is reproducible on Bionic/Queens clouds.

1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/c4VDkqyR2z/
3) If the script finishes with "Port not found" , the bug is still present.

[Where problems could occur]

Instances created prior to the Openstack Newton release that have more than one interface will not have associated information in the virtual_interfaces table that is required to repopulate the cache with interfaces in the same order they were attached prior. In the unlikely event that this occurs and you are using Openstack release Queen or Rocky, it will be necessary to either manually populate this table. Openstack Stein has a patch that adds support for generating this data. Since as things stand the guest will be unable to identify it's network information at all in the event the cache gets purged and given the hopefully low risk that a vm was created prior to Newton we hope the potential for this regression is very low.

[Discussion]
SRU team, please review the most recent version of nova 2:17.0.13-0ubuntu3 in the unapproved queue. The older version can be rejected.

------------------------------------------------------------------------------

Description
===========

During periodic task _heal_instance_info_cache the
instance_info_caches are not updated using instance port_ids taken
from neutron, but from nova db.

Sometimes, perhaps because of some race-condition, its possible to
lose some ports from instance_info_caches. Periodic task
_heal_instance_info_cache should clean this up (add missing records),
but in fact it's not working this way.

How it looks now?
=================

_heal_instance_info_cache during crontask:

https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525

is using network_api to get instance_nw_info (instance_info_caches):

          try:
              # Call to network API to get instance info.. this will
              # force an update to the instance's info_cache
              self.network_api.get_instance_nw_info(context, instance)

self.network_api.get_instance_nw_info() is listed below:

https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377

and it uses _build_network_info_model() without networks and port_ids
parameters (because we're not adding any new interface to instance):

https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356

Next: _gather_port_ids_and_networks() generates the list of instance
networks and port_ids:

    networks, port_ids = self._gather_port_ids_and_networks(
              context, instance, networks, port_ids, client)

https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390

https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393

As we see that _gather_port_ids_and_networks() takes the port list
from DB:

https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176

And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.

When the interface is missing and there is no port configured on
compute host (for example after compute reboot) - interface is not
added to instance and from neutron point of view port state is DOWN.

When the interface is missing in cache and we reboot hard the instance
- its not added as tapinterface in xml file = we don't have the
network on host.

Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.

Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |

So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$

We can also see 4 tap interfaces in xml file:

stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
    <target dev='tap6c230305-43'/>
    <target dev='tapb89d6863-fb'/>
    <target dev='tapa74c9ee8-c4'/>
    <target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$

3. Now lets 'corrupt' the instance_info_caches for this specific VM.
We also noticed some race-condition that cause the same problem, but
we're unable to reproduce it in devel environment.

Original one:

---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
 created_at: 2018-02-26 21:25:31
 updated_at: 2018-02-26 21:29:17
 deleted_at: NULL
         id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
    deleted: 0
1 row in set (0.00 sec)
----

Modified one (I removed first port from list):
tap6c230305-43

----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
 created_at: 2018-02-26 21:25:31
 updated_at: 2018-02-26 21:29:17
 deleted_at: NULL
         id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
    deleted: 0
----

4. Now lets take a look on `nova list`:

stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$

So as you see we missed one interface (private).

Nova interface-list shows it (because it calls neutron instead nova
itself):

stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$

5. During this time check the logs - yes, the
_heal_instance_info_cache has been running for a while but without
success - stil missing port in instance_info_caches table:

Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}

5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.

6. And now check connected interfaces - WOOPS there is no
`tap6c230305-43` on the list ;(

stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
    <target dev='tapb89d6863-fb'/>
    <target dev='tapa74c9ee8-c4'/>
    <target dev='tap71e6c6ad-80'/>

Environment
===========
Nova master branch, devstack

Changed in nova:
assignee: nobody → Maciej Jozefczyk (maciej.jozefczyk)
summary: - _heal_instance_info_cache base on cache not on ports from neutron side
+ _heal_instance_info_cache periodic task bases on port list from memory,
+ not from neutron server
summary: - _heal_instance_info_cache periodic task bases on port list from memory,
+ _heal_instance_info_cache periodic task bases on port list from nova db,
not from neutron server
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote : Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

Can we pass some kind of force_refresh parameter (which defaults to existing behavior) which will do the full refresh and then the _heal_instance_info_cache would be the only thing that passes True for that?

I worry about all of the spaghetti code in there and existing usage by different callers, but the heal instance info cache periodic task is meant to be a full refresh based on latest information from neutron, so it seems reasonable to do it in that case.

Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :

For me its okey to do force_refresh=False and use it only in _heal_instance_info_caches. I'll propose fix doing it. Thanks!

Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :

This fix is in conflict with bugfix: 46922068ac167f492dd303efb359d0c649d69118.
We need to think twice how to fix it.

commit 46922068ac167f492dd303efb359d0c649d69118
Author: Aaron Rosen <email address hidden>
Date: Thu Dec 5 17:28:17 2013 -0800

    Make network_cache more robust with neutron

    Currently, nova treats neutron as the source of truth for which ports are
    attached to an instance which is a false assumption. Because of this
    if someone creates a port in neutron with a device_id that matches one
    of their existing instance_ids that port will eventually show up in
    nova list (through the periodic heal task).

    This problem usually manifests it's self when nova-compute
    calls to neutron to create a port and the request times out (though
    the port is actually created in neutron). When this occurs the instance
    can be rescheduled on another compute node which it will call out to
    neutron again to create a port. In this case two ports will show
    up in the network_cache table (since they have the same instance_id) though
    only one port is attached to the instance.

    This patch addresses this issue by only adding ports to network_cache
    if nova successfully allocated the port (or it was passed in). This
    way these ghost ports are avoided. A follow up patch will come later
    that garbage collects these ports.

    Closes-bug: #1258620
    Closes-bug: #1272195

    Change-Id: I961c224d95291727c8614174de07805a0d0a9e46

melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
tags: added: compute neutron
Revision history for this message
Boris Bobrov (bbobrov) wrote :

We ran into this issue too.

We tried to fix bug https://bugs.launchpad.net/keystone/+bug/968696 by changing policy.json. And at some point of time our service users had incorrect permissions. We noticed that network information disappeared from "openstack server list". But even after we fixed service user permissions, network information was not restored.

Investigation revealed that periodic jobs query list of ports from neutron. Neutron returned empty list because of bad service user permissions. Nova successfully deleted the ports from info_cache. We noticed that and restored permissions. Neutron started to return non-empty list again, but nova did not consume it.

We managed to fix it by changing code and forcing nova to record ports from the list.

Revision history for this message
s10 (vlad-esten) wrote :

This bug might be caused by commit https://github.com/openstack/nova/commit/8694c1619d774bb8a6c23ed4c0f33df2084849bc
Nova never repopulate instance_info_cache if it is empty.

Revision history for this message
Matt Riedemann (mriedem) wrote :

That commit from arosen is from 2013, and this is fixed I think since then:

"""
This problem usually manifests it's self when nova-compute
     calls to neutron to create a port and the request times out (though
     the port is actually created in neutron). When this occurs the instance
     can be rescheduled on another compute node which it will call out to
     neutron again to create a port. In this case two ports will show
     up in the network_cache table (since they have the same instance_id) though
     only one port is attached to the instance.
"""

via this change:

https://review.openstack.org/#/c/520248/

So nova will cleanup ports created during a failed build prior to rescheduling.

So I think we should add a force_refresh flag to the _heal_instance_info_cache flow so that we refresh from neutron rather than the nova db.

Revision history for this message
Matt Riedemann (mriedem) wrote :

FWIW, our public cloud team (Huawei) reported the exact same issue as from comment 4 where the policy changed on the neutron side which resulted in returning no ports for the instance, so nova wiped out the entries from the cache and the heal periodic task didn't fix it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/591607

Changed in nova:
assignee: Maciej Jozefczyk (maciej.jozefczyk) → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Changed in nova:
assignee: Matt Riedemann (mriedem) → Maciej Jozefczyk (maciej.jozefczyk)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/614167

Revision history for this message
sean mooney (sean-k-mooney) wrote : Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

Hi i have a customer that filed a downstream bug for this on newton.

i have been able to reproduce this without any db surgery however i did have to use
curl to send a raw command to neutron api.

i could try and create a function test for this if that helped.

currently the only workaround i have found is to delete teh neutorn ports
and create new ones. when the instance is in this broken state you cannot use
add/remove port to fix it.

paste bin log of script execution
http://paste.openstack.org/show/735818/

repro scipt below.
------------------------------------------------------------------------
#!/bin/bash

set -x

IMAGE="cirros-0.3.5-x86_64-disk"
NETWORK="private"
FLAVOR="m1.nano"
TEMP_TOKEN=$(openstack token issue -c id -f value)
SERVER=$(openstack server create --image ${IMAGE} --flavor ${FLAVOR} --network ${NETWORK} -c id -f value --wait repro-bug)
openstack server show ${SERVER}
PORT_ID=$(openstack port list --device-id ${SERVER} -f value -c id)
openstack port show ${PORT_ID}

#wait for it to be fully up
sleep 10

NEUTRON_ENDPOINT=$(openstack endpoint list --service network -f value -c URL)
curl -X PUT -H "X-Auth-Token:${TEMP_TOKEN}" -d '{ "port":{"device_id":"","device_owner":"", "binding:host_id":"" }}' "${NEUTRON_ENDPOINT}v2.0/ports/${PORT_ID}" | python -mjson.tool
# after this curl command nova and neutron will diagree as to the state of the port.

openstack server reboot --hard --wait ${SERVER}
# after the vm is rebooted the vm will not have an interface attached
openstack server show ${SERVER}
openstack port show ${PORT_ID}

#try to fix the issue by attaching the port again
openstack server add port ${SERVER} ${PORT_ID}
# note this will result in fixing the port in neutron but it will be broken on the nova side
# as a result the vm will still not have an interface attach but nuetron will say it is.
openstack server show ${SERVER}
openstack port show ${PORT_ID}

# wait for nova to have time to try and attach the interface
sleep 30
openstack server reboot --hard --wait ${SERVER}

set +x

Revision history for this message
Matt Riedemann (mriedem) wrote :

I saw this over IRC last night:

(5:12:51 PM) pacharya_: Hi need some help with nova instance info cache table. Due to some network connectivity issues nova received empty list during the heal instance info cache periodic task and instance cache table got updated with same.
(5:13:12 PM) pacharya_: now the list and get API for that instance does not have any IPs listed
(5:13:20 PM) pacharya_: Does anyone know how to fix this?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/614167
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3534471c578eda6236e79f43153788c4725a5634
Submitter: Zuul
Branch: master

commit 3534471c578eda6236e79f43153788c4725a5634
Author: Maciej Jozefczyk <email address hidden>
Date: Tue Oct 30 09:58:30 2018 +0000

    Add fill_virtual_interface_list online_data_migration script

    In change [1] we modified _heal_instance_info_cache periodic task
    to use Neutron point of view while rebuilding InstanceInfoCache
    objects.
    The crucial point was how we know the previous order of ports, if
    the cache was broken. We decided to use VirtualInterfaceList objects
    as source of port order.
    For instances older than Newton VirtualInterface objects doesn't
    exist, so we need to introduce a way of creating it.
    This script should be executed while upgrading to Stein release.

    [1] https://review.openstack.org/#/c/591607

    Change-Id: Ic26d4ce3d071691a621d3c925dc5cd436b2005f1
    Related-Bug: 1751923

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/591607
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ba44c155ce1dcefede9741722a0525820d6da2b8
Submitter: Zuul
Branch: master

commit ba44c155ce1dcefede9741722a0525820d6da2b8
Author: Matt Riedemann <email address hidden>
Date: Tue Aug 14 17:57:53 2018 +0800

    Force refresh instance info_cache during heal

    If the instance info_cache is corrupted somehow, like during
    a host reboot and the ports aren't wired up properly or
    a mistaken policy change in neutron results in nova resetting
    the info_cache to an empty list, the _heal_instance_info_cache
    is meant to fix it (once the current state of the ports for
    the instance in neutron is corrected). However, the task is
    currently only refreshing the cache *based* on the current contents
    of the cache, which defeats the purpose of neutron being the source
    of truth for the ports attached to the instance.

    This change makes the _heal_instance_info_cache periodic task
    pass a "force_refresh" kwarg, which defaults to False for backward
    compatibility with other methods that refresh the cache after
    operations like attach/detach interface, and if True will make
    nova get the current state of the ports for the instance from neutron
    and fully rebuild the info_cache.

    To not lose port order in info_cache this change takes original order
    from nova historical data that are stored as VirtualInterfaceList
    objects. For ports that are not registered as VirtualInterfaces
    objects it will add them at the end of port_order list. Due to this
    for instances older than Newton another patch was introduced to fill
    missing VirtualInterface objects in the DB [1].

    Long-term we should be able to refactor some of the older refresh
    code which leverages the cache to instead use the refresh_vif_id
    kwarg so that we do targeted cache updates when we do things like
    attach and detach ports, but that's a change for another day.

    [1] https://review.openstack.org/#/c/614167

    Co-Authored-By: Maciej Jozefczyk <email address hidden>
    Change-Id: I629415236b2447128ae9a980d4ebe730a082c461
    Closes-Bug: #1751923

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/640516

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/653040

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by Mohammed Naser (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/653040

Revision history for this message
Edward Hope-Morley (hopem) wrote : Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

I also hit this issue on our Queens cloud were 232 vms were affected. I created local backports for Q and R and tested Q backport in our cloud and can confirm that it did automatically resolve the problem i.e. the periodic task eventually re-healed all the caches correctly. Therefore I would like to propose this for backport to Q & R.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/679271

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/679274

Changed in nova (Ubuntu Disco):
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/640516

Revision history for this message
Launchpad Janitor (janitor) wrote : Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu Bionic):
status: New → Confirmed
Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
Jason Davidson (djason2018) wrote :

We just had ~220 instances lose their ip addresses on one of our Queens clouds. Is anyone still working on a fix for this issue?

Revision history for this message
sean mooney (sean-k-mooney) wrote :

for what its worth this has been partially backported downstream in redhat osp

we backported only the self healying and not the online data migration which had a bug in it.

so https://review.opendev.org/#/c/591607/ can be safely backport ported but https://review.opendev.org/#/c/614167/20 has a bug and should not be backported.

Changed in nova (Ubuntu):
status: Confirmed → Fix Released
Changed in nova (Ubuntu Bionic):
assignee: nobody → Jorge Niedbalski (niedbalski)
summary: - _heal_instance_info_cache periodic task bases on port list from nova db,
- not from neutron server
+ [SRU]_heal_instance_info_cache periodic task bases on port list from
+ nova db, not from neutron server
description: updated
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Hello,

I've prepared a PPA for testing the proposed patch on B/Queens
https://launchpad.net/~niedbalski/+archive/ubuntu/lp1751923/+packages

Attached is the debdiff for bionic.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
description: updated
description: updated
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Since Queens is populating the virtual_interfaces table as standard I think we should proceed with this SRU - https://pastebin.ubuntu.com/p/BdCPsVKGk5/ - since it will provide a clean fix for Queens clouds.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

@corey anything in specific you need at my end to get this SRU reviewed?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

@coreycb I think we have everything we need to proceed with this SRU now. Since Queens is the oldest release currently supported on Ubuntu and support for populating vif attach ordering required to rebuild the cache has been available since Newton I think the risk of anyone being impacted is very small. VMs created prior to Newton would need the patch [1] and eventually [2] backported from Stein but I don't see them as essential and given the impact of not having this fix asap I think it supersedes those which we can handle separately.

[1] https://github.com/openstack/nova/commit/3534471c578eda6236e79f43153788c4725a5634
[2] https://bugs.launchpad.net/nova/+bug/1825034

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Restored the bug description to its original format and updated SRU info.

description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks Jorge. Let's patch rocky as well for upgrade purposes.

Revision history for this message
Corey Bryant (corey.bryant) wrote :
Changed in nova (Ubuntu Bionic):
status: Confirmed → Triaged
status: Triaged → In Progress
description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Maciej, or anyone else affected,

Accepted nova into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I'm a bit worried about the aforementioned regression potential here, but I'll accept it seeing that this was accepted by the OpenStack team. Since I'd best prefer if the SRUs were as safe as possible, offering fallback functionality in case the system is old. I assume this would require the additional commits cherry-picked?

Anyway, let's proceed for now.

Changed in nova (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Maciej, or anyone else affected,

Accepted nova into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:17.0.13-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Maciej, or anyone else affected,

Accepted nova into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Mathew Hodson (mhodson)
Changed in nova (Ubuntu):
importance: Undecided → Medium
Changed in nova (Ubuntu Bionic):
importance: Undecided → Medium
Changed in nova (Ubuntu Disco):
importance: Undecided → Medium
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

I am in the process to verify bionic/rocky/queens releases.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Download full text (8.7 KiB)

Hello,

I've verified that this problem doesn't reproduces with the package contained in proposed.

1) Deployed this bundle of bionic-queens

Upgraded to the following version:

root@juju-51d6ad-1751923-6:/home/ubuntu# dpkg -l | grep nova
ii nova-api-os-compute 2:17.0.13-0ubuntu3 all OpenStack Compute - OpenStack Compute API frontend
ii nova-common 2:17.0.13-0ubuntu3 all OpenStack Compute - common files
ii nova-conductor 2:17.0.13-0ubuntu3 all OpenStack Compute - conductor service
ii nova-placement-api 2:17.0.13-0ubuntu3 all OpenStack Compute - placement API frontend
ii nova-scheduler 2:17.0.13-0ubuntu3 all OpenStack Compute - virtual machine scheduler
ii python-nova 2:17.0.13-0ubuntu3 all OpenStack Compute Python libraries

root@juju-51d6ad-1751923-7:/home/ubuntu# dpkg -l | grep nova
ii nova-api-metadata 2:17.0.13-0ubuntu3 all OpenStack Compute - metadata API frontend
ii nova-common 2:17.0.13-0ubuntu3 all OpenStack Compute - common files
ii nova-compute 2:17.0.13-0ubuntu3 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:17.0.13-0ubuntu3 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:17.0.13-0ubuntu3 all OpenStack Compute - compute node libvirt support
ii python-nova 2:17.0.13-0ubuntu3 all OpenStack Compute Python libraries
ii python-novaclient 2:9.1.1-0ubuntu1 all client library for OpenStack Compute API - Python 2.7
ii python3-novaclient 2:9.1.1-0ubuntu1 all client library for OpenStack Compute API - 3.x

root@juju-51d6ad-1751923-6:/home/ubuntu# systemctl status nova*|grep -i active
   Active: active (running) since Fri 2021-08-27 22:02:25 UTC; 1h 7min ago
   Active: active (running) since Fri 2021-08-27 22:02:12 UTC; 1h 8min ago
   Active: active (running) since Fri 2021-08-27 22:02:25 UTC; 1h 7min ago

3) Created a server with 4 private ports, 1 public one.

ubuntu@niedbalski-bastion:~/stsstack-bundles/openstack$ openstack server list
+--------------------------------------+---------------+--------+-------------------------------------------------------------------------------+--------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+---------------+--------+-------------------------------------------------------------------------------+--------+-----------+
| 5843e6b5-e1a7-4208-9f19-1d051c032afb | cirros-23...

Read more...

tags: added: verification-done-bionic verification-queens-done
removed: verification-needed-bionic verification-queens-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:17.0.13-0ubuntu3

---------------
nova (2:17.0.13-0ubuntu3) bionic; urgency=medium

  * Force refresh instance info_cache during heal (LP: #1751923):
    - d/p/0001-Force-refresh-instance-info_cache-during-heal.patch
    - d/p/0002-remove-deprecated-test_list_vifs_neutron_notimplemented.patch

 -- Jorge Niedbalski <email address hidden> Mon, 17 May 2021 14:25:43 -0400

Changed in nova (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for nova has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Verified rocky-proposed using [Test Plan] with output as follows:

# apt-cache policy nova-common
nova-common:
  Installed: 2:18.3.0-0ubuntu1~cloud3
  Candidate: 2:18.3.0-0ubuntu1~cloud3
  Version table:
 *** 2:18.3.0-0ubuntu1~cloud3 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-proposed/rocky/main amd64 Packages
        100 /var/lib/dpkg/status
     2:17.0.13-0ubuntu3 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
     2:17.0.10-0ubuntu2.1 500
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     2:17.0.1-0ubuntu1 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic/main amd64 Packages

I also tested by manually deleting the network_info for a vm then waiting for the periodic task to run - https://pastebin.ubuntu.com/p/7gmZQsvC8H/

tags: added: verification-rocky-done
removed: verification-rocky-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package nova - 2:18.3.0-0ubuntu1~cloud3
---------------

 nova (2:18.3.0-0ubuntu1~cloud3) bionic-rocky; urgency=medium
 .
   * Force refresh instance info_cache during heal (LP: #1751923):
     - d/p/0001-Force-refresh-instance-info_cache-during-heal.patch

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package nova - 2:17.0.13-0ubuntu3~cloud0
---------------

 nova (2:17.0.13-0ubuntu3~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 nova (2:17.0.13-0ubuntu3) bionic; urgency=medium
 .
   * Force refresh instance info_cache during heal (LP: #1751923):
     - d/p/0001-Force-refresh-instance-info_cache-during-heal.patch
     - d/p/0002-remove-deprecated-test_list_vifs_neutron_notimplemented.patch

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers