2018-02-26 22:16:50 |
Maciej Jozefczyk |
bug |
|
|
added bug |
2018-02-26 22:17:11 |
Maciej Jozefczyk |
nova: assignee |
|
Maciej Jozefczyk (maciej.jozefczyk) |
|
2018-02-26 22:18:39 |
Maciej Jozefczyk |
summary |
_heal_instance_info_cache base on cache not on ports from neutron side |
_heal_instance_info_cache periodic task bases on port list from memory, not from neutron server |
|
2018-02-27 14:10:38 |
Maciej Jozefczyk |
summary |
_heal_instance_info_cache periodic task bases on port list from memory, not from neutron server |
_heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server |
|
2018-02-27 19:42:21 |
Maciej Jozefczyk |
description |
Description
===========
During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from compute host memory.
Sometimes, perhaps because of some race-condition, its possible to lose some ports from instance_info_caches. Periodic task _heal_instance_info_cache should clean this up (add missing records), but in fact it's not working this way.
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
try:
# Call to network API to get instance info.. this will
# force an update to the instance's info_cache
self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
networks, port_ids = self._gather_port_ids_and_networks(
context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tap6c230305-43'/>
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
Description
===========
During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
Sometimes, perhaps because of some race-condition, its possible to lose some ports from instance_info_caches. Periodic task _heal_instance_info_cache should clean this up (add missing records), but in fact it's not working this way.
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
try:
# Call to network API to get instance info.. this will
# force an update to the instance's info_cache
self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
networks, port_ids = self._gather_port_ids_and_networks(
context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tap6c230305-43'/>
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
|
2018-04-05 16:22:26 |
melanie witt |
nova: importance |
Undecided |
Medium |
|
2018-04-05 16:22:26 |
melanie witt |
nova: status |
New |
Confirmed |
|
2018-04-05 16:22:46 |
melanie witt |
tags |
|
compute neutron |
|
2018-06-08 13:56:32 |
Boris Bobrov |
bug |
|
|
added subscriber Boris Bobrov |
2018-08-14 10:02:28 |
OpenStack Infra |
nova: status |
Confirmed |
In Progress |
|
2018-08-14 10:02:28 |
OpenStack Infra |
nova: assignee |
Maciej Jozefczyk (maciej.jozefczyk) |
Matt Riedemann (mriedem) |
|
2018-08-24 07:23:13 |
Fan Zhang |
bug |
|
|
added subscriber Fan Zhang |
2018-10-25 14:23:21 |
OpenStack Infra |
nova: assignee |
Matt Riedemann (mriedem) |
Maciej Jozefczyk (maciej.jozefczyk) |
|
2018-11-21 13:37:03 |
s10 |
bug |
|
|
added subscriber s10 |
2019-01-31 14:02:00 |
OpenStack Infra |
nova: status |
In Progress |
Fix Released |
|
2019-08-29 14:37:22 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu) |
|
2019-08-29 14:37:35 |
Edward Hope-Morley |
bug task added |
|
cloud-archive |
|
2019-08-29 14:37:49 |
Edward Hope-Morley |
nominated for series |
|
cloud-archive/rocky |
|
2019-08-29 14:37:49 |
Edward Hope-Morley |
bug task added |
|
cloud-archive/rocky |
|
2019-08-29 14:37:49 |
Edward Hope-Morley |
nominated for series |
|
cloud-archive/queens |
|
2019-08-29 14:37:49 |
Edward Hope-Morley |
bug task added |
|
cloud-archive/queens |
|
2019-08-29 14:37:49 |
Edward Hope-Morley |
nominated for series |
|
cloud-archive/stein |
|
2019-08-29 14:37:49 |
Edward Hope-Morley |
bug task added |
|
cloud-archive/stein |
|
2019-08-29 14:38:18 |
Edward Hope-Morley |
nominated for series |
|
Ubuntu Disco |
|
2019-08-29 14:38:18 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu Disco) |
|
2019-08-29 14:38:18 |
Edward Hope-Morley |
nominated for series |
|
Ubuntu Bionic |
|
2019-08-29 14:38:18 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu Bionic) |
|
2019-08-29 14:41:40 |
Edward Hope-Morley |
cloud-archive/stein: status |
New |
Fix Released |
|
2019-08-29 14:52:07 |
Edward Hope-Morley |
nova (Ubuntu Disco): status |
New |
Fix Released |
|
2019-08-30 08:15:49 |
Rahul Sachan |
bug |
|
|
added subscriber Rahul Sachan |
2019-11-05 13:44:51 |
Launchpad Janitor |
nova (Ubuntu): status |
New |
Confirmed |
|
2019-11-05 13:44:51 |
Launchpad Janitor |
nova (Ubuntu Bionic): status |
New |
Confirmed |
|
2020-03-25 15:03:12 |
Jason Davidson |
bug |
|
|
added subscriber Jason Davidson |
2021-05-15 21:29:16 |
Jorge Niedbalski |
nova (Ubuntu): status |
Confirmed |
Fix Released |
|
2021-05-15 21:29:36 |
Jorge Niedbalski |
cloud-archive/queens: status |
New |
In Progress |
|
2021-05-15 21:29:44 |
Jorge Niedbalski |
cloud-archive/queens: assignee |
|
Jorge Niedbalski (niedbalski) |
|
2021-05-15 21:29:51 |
Jorge Niedbalski |
nova (Ubuntu Bionic): assignee |
|
Jorge Niedbalski (niedbalski) |
|
2021-05-15 21:30:13 |
Jorge Niedbalski |
summary |
_heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server |
[SRU]_heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server |
|
2021-05-15 21:43:35 |
Jorge Niedbalski |
description |
Description
===========
During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
Sometimes, perhaps because of some race-condition, its possible to lose some ports from instance_info_caches. Periodic task _heal_instance_info_cache should clean this up (add missing records), but in fact it's not working this way.
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
try:
# Call to network API to get instance info.. this will
# force an update to the instance's info_cache
self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
networks, port_ids = self._gather_port_ids_and_networks(
context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tap6c230305-43'/>
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/DrFcDXZGSt/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
** Check the other info section ***
[Other Info]
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Call to network API to get instance info.. this will
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# force an update to the instance's info_cache
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0networks, port_ids = self._gather_port_ids_and_networks(
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap6c230305-43'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
|
2021-05-17 08:16:20 |
Dominique Poulain |
bug |
|
|
added subscriber Dominique Poulain |
2021-05-17 18:53:27 |
Jorge Niedbalski |
cloud-archive/rocky: status |
In Progress |
Won't Fix |
|
2021-05-17 20:24:51 |
Jorge Niedbalski |
attachment added |
|
lp1751923_bionic.debdiff https://bugs.launchpad.net/nova/+bug/1751923/+attachment/5498309/+files/lp1751923_bionic.debdiff |
|
2021-05-17 20:26:22 |
Jorge Niedbalski |
description |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/DrFcDXZGSt/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
** Check the other info section ***
[Other Info]
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Call to network API to get instance info.. this will
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# force an update to the instance's info_cache
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0networks, port_ids = self._gather_port_ids_and_networks(
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap6c230305-43'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/DrFcDXZGSt/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
** No specific regression potential has been identified.
** Check the other info section ***
[Other Info]
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Call to network API to get instance info.. this will
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# force an update to the instance's info_cache
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0networks, port_ids = self._gather_port_ids_and_networks(
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresseshttps://launchpad.net/~niedbalski/+archive/ubuntu/lp1751923/+packages
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap6c230305-43'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
|
2021-05-17 21:36:23 |
Jorge Niedbalski |
description |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/DrFcDXZGSt/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
** No specific regression potential has been identified.
** Check the other info section ***
[Other Info]
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Call to network API to get instance info.. this will
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# force an update to the instance's info_cache
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0networks, port_ids = self._gather_port_ids_and_networks(
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresseshttps://launchpad.net/~niedbalski/+archive/ubuntu/lp1751923/+packages
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap6c230305-43'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/c4VDkqyR2z/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
** No specific regression potential has been identified.
** Check the other info section ***
[Other Info]
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Call to network API to get instance info.. this will
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# force an update to the instance's info_cache
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0networks, port_ids = self._gather_port_ids_and_networks(
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresseshttps://launchpad.net/~niedbalski/+archive/ubuntu/lp1751923/+packages
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap6c230305-43'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
|
2021-05-18 08:28:32 |
Arif Ali |
bug |
|
|
added subscriber Arif Ali |
2021-06-28 16:00:51 |
Edward Hope-Morley |
description |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/c4VDkqyR2z/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
** No specific regression potential has been identified.
** Check the other info section ***
[Other Info]
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Call to network API to get instance info.. this will
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# force an update to the instance's info_cache
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance networks and port_ids:
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0networks, port_ids = self._gather_port_ids_and_networks(
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on compute host (for example after compute reboot) - interface is not added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance - its not added as tapinterface in xml file = we don't have the network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresseshttps://launchpad.net/~niedbalski/+archive/ubuntu/lp1751923/+packages
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap6c230305-43'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM. We also noticed some race-condition that cause the same problem, but we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
\u00a0\u00a0\u00a0created_at: 2018-02-26 21:25:31
\u00a0\u00a0\u00a0updated_at: 2018-02-26 21:29:17
\u00a0\u00a0\u00a0deleted_at: NULL
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0id: 2
\u00a0network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the _heal_instance_info_cache has been running for a while but without success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no `tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapb89d6863-fb'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tapa74c9ee8-c4'/>
\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/c4VDkqyR2z/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
Instances created prior to the Openstack Newton release that have more than one interface will not have associated information in the virtual_interfaces table that is required to repopulate the cache with interfaces in the same order they were attached prior. In the unlikely event that this occurs and you are using Openstack release Queen or Rocky, it will be necessary to either manually populate this table. Openstack Stein has a patch that adds support for generating this data. Since as things stand the guest will be unable to identify it's network information at all in the event the cache gets purged and given the hopefully low risk that a vm was created prior to Newton we hope the potential for this regression is very low.
------------------------------------------------------------------------------
Description
===========
During periodic task _heal_instance_info_cache the
instance_info_caches are not updated using instance port_ids taken
from neutron, but from nova db.
Sometimes, perhaps because of some race-condition, its possible to
lose some ports from instance_info_caches. Periodic task
_heal_instance_info_cache should clean this up (add missing records),
but in fact it's not working this way.
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
try:
# Call to network API to get instance info.. this will
# force an update to the instance's info_cache
self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids
parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance
networks and port_ids:
networks, port_ids = self._gather_port_ids_and_networks(
context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list
from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on
compute host (for example after compute reboot) - interface is not
added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance
- its not added as tapinterface in xml file = we don't have the
network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tap6c230305-43'/>
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM.
We also noticed some race-condition that cause the same problem, but
we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova
itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the
_heal_instance_info_cache has been running for a while but without
success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no
`tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
|
2021-06-28 17:34:25 |
Corey Bryant |
cloud-archive/rocky: status |
Won't Fix |
In Progress |
|
2021-06-28 17:41:12 |
Corey Bryant |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2021-06-28 17:41:29 |
Corey Bryant |
nova (Ubuntu Bionic): status |
Confirmed |
Triaged |
|
2021-06-28 17:41:32 |
Corey Bryant |
nova (Ubuntu Bionic): status |
Triaged |
In Progress |
|
2021-06-28 17:50:49 |
Corey Bryant |
description |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/c4VDkqyR2z/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
Instances created prior to the Openstack Newton release that have more than one interface will not have associated information in the virtual_interfaces table that is required to repopulate the cache with interfaces in the same order they were attached prior. In the unlikely event that this occurs and you are using Openstack release Queen or Rocky, it will be necessary to either manually populate this table. Openstack Stein has a patch that adds support for generating this data. Since as things stand the guest will be unable to identify it's network information at all in the event the cache gets purged and given the hopefully low risk that a vm was created prior to Newton we hope the potential for this regression is very low.
------------------------------------------------------------------------------
Description
===========
During periodic task _heal_instance_info_cache the
instance_info_caches are not updated using instance port_ids taken
from neutron, but from nova db.
Sometimes, perhaps because of some race-condition, its possible to
lose some ports from instance_info_caches. Periodic task
_heal_instance_info_cache should clean this up (add missing records),
but in fact it's not working this way.
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
try:
# Call to network API to get instance info.. this will
# force an update to the instance's info_cache
self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids
parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance
networks and port_ids:
networks, port_ids = self._gather_port_ids_and_networks(
context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list
from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on
compute host (for example after compute reboot) - interface is not
added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance
- its not added as tapinterface in xml file = we don't have the
network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tap6c230305-43'/>
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM.
We also noticed some race-condition that cause the same problem, but
we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova
itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the
_heal_instance_info_cache has been running for a while but without
success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no
`tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
[Impact]
* During periodic task _heal_instance_info_cache the instance_info_caches are not updated using instance port_ids taken from neutron, but from nova db.
* This causes that existing VMs to loose their network interfaces after reboot.
[Test Plan]
* This bug is reproducible on Bionic/Queens clouds.
1) Deploy the following Juju bundle: https://paste.ubuntu.com/p/HgsqZfsDGh/
2) Run the following script: https://paste.ubuntu.com/p/c4VDkqyR2z/
3) If the script finishes with "Port not found" , the bug is still present.
[Where problems could occur]
Instances created prior to the Openstack Newton release that have more than one interface will not have associated information in the virtual_interfaces table that is required to repopulate the cache with interfaces in the same order they were attached prior. In the unlikely event that this occurs and you are using Openstack release Queen or Rocky, it will be necessary to either manually populate this table. Openstack Stein has a patch that adds support for generating this data. Since as things stand the guest will be unable to identify it's network information at all in the event the cache gets purged and given the hopefully low risk that a vm was created prior to Newton we hope the potential for this regression is very low.
[Discussion]
SRU team, please review the most recent version of nova 2:17.0.13-0ubuntu3 in the unapproved queue. The older version can be rejected.
------------------------------------------------------------------------------
Description
===========
During periodic task _heal_instance_info_cache the
instance_info_caches are not updated using instance port_ids taken
from neutron, but from nova db.
Sometimes, perhaps because of some race-condition, its possible to
lose some ports from instance_info_caches. Periodic task
_heal_instance_info_cache should clean this up (add missing records),
but in fact it's not working this way.
How it looks now?
=================
_heal_instance_info_cache during crontask:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525
is using network_api to get instance_nw_info (instance_info_caches):
try:
# Call to network API to get instance info.. this will
# force an update to the instance's info_cache
self.network_api.get_instance_nw_info(context, instance)
self.network_api.get_instance_nw_info() is listed below:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377
and it uses _build_network_info_model() without networks and port_ids
parameters (because we're not adding any new interface to instance):
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356
Next: _gather_port_ids_and_networks() generates the list of instance
networks and port_ids:
networks, port_ids = self._gather_port_ids_and_networks(
context, instance, networks, port_ids, client)
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393
As we see that _gather_port_ids_and_networks() takes the port list
from DB:
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176
And thats it. When we lose a port its not possible to add it again with this periodic task.
The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.
When the interface is missing and there is no port configured on
compute host (for example after compute reboot) - interface is not
added to instance and from neutron point of view port state is DOWN.
When the interface is missing in cache and we reboot hard the instance
- its not added as tapinterface in xml file = we don't have the
network on host.
Steps to reproduce
==================
1. Spawn devstack
2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
3. Update the DB row, drop one interface from interfaces_list
4. Hard-Reboot the instance
5. See that nova list shows instance without one address, but nova interface-list shows all addresses
6. See that one port is missing in instance xml files
7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.
Reproduced Example
==================
1. Spawn VM with 1 private network port
nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private test-2
2. Attach ports to have 2 private and 2 public interfaces
nova list:
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
So we see 4 ports:
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
We can also see 4 tap interfaces in xml file:
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tap6c230305-43'/>
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
stack@mjozefcz-devstack-ptg:~$
3. Now lets 'corrupt' the instance_info_caches for this specific VM.
We also noticed some race-condition that cause the same problem, but
we're unable to reproduce it in devel environment.
Original one:
---
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
1 row in set (0.00 sec)
----
Modified one (I removed first port from list):
tap6c230305-43
----
mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
*************************** 1. row ***************************
created_at: 2018-02-26 21:25:31
updated_at: 2018-02-26 21:29:17
deleted_at: NULL
id: 2
network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
deleted: 0
----
4. Now lets take a look on `nova list`:
stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
| a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | - | Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
stack@mjozefcz-devstack-ptg:~$
So as you see we missed one interface (private).
Nova interface-list shows it (because it calls neutron instead nova
itself):
stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
| ACTIVE | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
| ACTIVE | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c | fa:16:3e:6d:dc:85 |
| ACTIVE | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e | fa:16:3e:cf:0c:e0 |
| ACTIVE | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
stack@mjozefcz-devstack-ptg:~$
5. During this time check the logs - yes, the
_heal_instance_info_cache has been running for a while but without
success - stil missing port in instance_info_caches table:
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}
5. Ok, so lets pretend that customer restart the VM.
stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
Request to reboot server <Server: test-2> has been accepted.
6. And now check connected interfaces - WOOPS there is no
`tap6c230305-43` on the list ;(
stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
<target dev='tapb89d6863-fb'/>
<target dev='tapa74c9ee8-c4'/>
<target dev='tap71e6c6ad-80'/>
Environment
===========
Nova master branch, devstack |
|
2021-06-29 13:36:02 |
Corey Bryant |
cloud-archive/rocky: status |
In Progress |
Fix Committed |
|
2021-06-29 13:36:05 |
Corey Bryant |
tags |
compute neutron |
compute neutron verification-rocky-needed |
|
2021-07-05 13:15:12 |
Łukasz Zemczak |
nova (Ubuntu Bionic): status |
In Progress |
Fix Committed |
|
2021-07-05 13:15:17 |
Łukasz Zemczak |
bug |
|
|
added subscriber SRU Verification |
2021-07-05 13:15:24 |
Łukasz Zemczak |
tags |
compute neutron verification-rocky-needed |
compute neutron verification-needed verification-needed-bionic verification-rocky-needed |
|
2021-07-06 21:17:30 |
Corey Bryant |
cloud-archive/queens: status |
In Progress |
Fix Committed |
|
2021-07-06 21:17:32 |
Corey Bryant |
tags |
compute neutron verification-needed verification-needed-bionic verification-rocky-needed |
compute neutron verification-needed verification-needed-bionic verification-queens-needed verification-rocky-needed |
|
2021-07-20 11:10:01 |
Christian Rohmann |
bug |
|
|
added subscriber Christian Rohmann |
2021-07-23 00:41:20 |
Mathew Hodson |
nova (Ubuntu): importance |
Undecided |
Medium |
|
2021-07-23 00:41:23 |
Mathew Hodson |
nova (Ubuntu Bionic): importance |
Undecided |
Medium |
|
2021-07-23 00:41:28 |
Mathew Hodson |
nova (Ubuntu Disco): importance |
Undecided |
Medium |
|
2021-08-27 23:49:16 |
Jorge Niedbalski |
tags |
compute neutron verification-needed verification-needed-bionic verification-queens-needed verification-rocky-needed |
compute neutron verification-done-bionic verification-needed verification-queens-done verification-rocky-needed |
|
2021-08-30 08:16:05 |
Łukasz Zemczak |
removed subscriber Ubuntu Stable Release Updates Team |
|
|
|
2021-08-30 08:16:03 |
Launchpad Janitor |
nova (Ubuntu Bionic): status |
Fix Committed |
Fix Released |
|
2021-09-01 12:26:08 |
Edward Hope-Morley |
tags |
compute neutron verification-done-bionic verification-needed verification-queens-done verification-rocky-needed |
compute neutron verification-done-bionic verification-needed verification-queens-done verification-rocky-done |
|
2021-09-01 13:27:42 |
Corey Bryant |
cloud-archive/rocky: status |
Fix Committed |
Fix Released |
|
2021-09-01 13:46:00 |
Corey Bryant |
cloud-archive/queens: status |
Fix Committed |
Fix Released |
|