Bug #1836252 “Neutron dhcp not coming up after lock unlock compu...” : Bugs : StarlingX

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-11:

#1

ALL_NODES_20190711.185731.tar Edit (31.5 MiB, application/x-tar)

Revision history for this message

Matt Peters (mpeters-wrs) wrote on 2019-07-12:

#2

Do you have the additional router reachability information normally collected for this type of issue? Was the VM just not reachable externally, i.e. did each VM have connectivity to the gateway and other VMs? Were the virtual router gateways reachable from an external endpoint? If you have this information, can you please add it to this LP.

Ghada Khalil (gkhalil) on 2019-07-15

tags:	added: stx.networking
Changed in starlingx:
status:	New → Incomplete

Ghada Khalil (gkhalil) on 2019-07-16

Changed in starlingx:
assignee:	nobody → Joseph Richard (josephrichard)

Ghada Khalil (gkhalil) on 2019-07-16

tags:	added: stx.2.0
tags:	added: stx.sanity

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-16:

#3

As per Peng, issue was reproduced on another system (wcp63-66) using the 2019-07-16 cengn load.
Marking as stx.2.0 gating as there are now multiple occurrences; further investigation is required.

Changed in starlingx:
status:	Incomplete → Triaged
importance:	Undecided → High

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-07-16:

#4

The issue was with dhcp not coming up on e195eaf6-9ccd-485d-a03c-bddc05fa9a96(tenant1-mgmt-net). e195eaf6-9ccd-485d-a03c-bddc05fa9a96 was scheduled to the dhcp agent on compute-1, but the dhcp agent was not being updated with this information, so did not launch a dnsmasq process for this port. There was a namespace for that e195eaf6-9ccd-485d-a03c-bddc05fa9a96, indicating that the network had previously been on that compute node, and was probably rescheduled away from it, and then back to it.

How frequently is this occuring?

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-16:

#5

So far we observed twice.

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-07-18:

#6

Do you have the compute logs from the first time you saw this?

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-18:

#7

Download full text (15.8 KiB)

Issue apparent in weekly nova regression testcase test_kpi_live_migrate[virtio]

FAIL 20190714 12:25:41 test_kpi_live_migrate[virtio]
Lab: WCP_63_66
Load: 20190713T013000Z

For the following instance
name='instance-00000171' uuid=dafd6ed4-584b-42dc-9a3c-392a041a7fd0

[2019-07-14 12:27:47,762] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------+--------+------------------------------------------------------------------------------------------+-------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------+--------+------------------------------------------------------------------------------------------+-------+----------+
| dafd6ed4-584b-42dc-9a3c-392a041a7fd0 | tenant2-virtio-0-130 | ACTIVE | internal0-net0-1=10.1.1.153; tenant2-mgmt-net=192.168.200.244; tenant2-net0=172.18.0.244 | | virtio-2 |
| 97a93dbc-e59b-4b16-8c9d-3cfb26c3084c | tenant1-virtio-0-129 | ACTIVE | internal0-net0-1=10.1.1.212; tenant1-mgmt-net=192.168.100.150; tenant1-net1=172.16.1.180 | | virtio |

[2019-07-14 12:27:54,084] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.200.244 failed.

{"log":"2019-07-14 12:27:38.893 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4243fe83-9615-4416-a4dd-af0cb5ebe8bd updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e0a1f447-786d-41fc-815a-839d1d9644f5', 'segmentation_id': 808, 'fixed_ips': [{'subnet_id': '59652421-1905-4e5b-acbb-e272bbcb3549', 'ip_address': '10.1.1.153'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:0e:2a:96', 'device': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'port_security_enabled': True, 'port_id': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.8942472Z"}
{"log":"2019-07-14 12:27:38.893 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4243fe83-9615-4416-a4dd-af0cb5ebe8bd updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e0a1f447-786d-41fc-815a-839d1d9644f5', 'segmentation_id': 808, 'fixed_ips': [{'subnet_id': '59652421-1905-4e5b-acbb-e272bbcb3549', 'ip_address': '10.1.1.153'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:0e:2a:96', 'device': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'port_security_enabled': True, 'port_id': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:3...

Issue apparent in weekly nova regression testcase test_kpi_live_migrate[virtio]

FAIL           20190714 12:25:41      test_kpi_live_migrate[virtio]
Lab: WCP_63_66
Load: 20190713T013000Z

For the following instance 
name='instance-00000171' uuid=dafd6ed4-584b-42dc-9a3c-392a041a7fd0

[2019-07-14 12:27:47,762] 423  DEBUG MainThread ssh.expect  :: Output: 
+--------------------------------------+----------------------+--------+------------------------------------------------------------------------------------------+-------+----------+
| ID                                   | Name                 | Status | Networks                                                                                 | Image | Flavor   |
+--------------------------------------+----------------------+--------+------------------------------------------------------------------------------------------+-------+----------+
| dafd6ed4-584b-42dc-9a3c-392a041a7fd0 | tenant2-virtio-0-130 | ACTIVE | internal0-net0-1=10.1.1.153; tenant2-mgmt-net=192.168.200.244; tenant2-net0=172.18.0.244 |       | virtio-2 |
| 97a93dbc-e59b-4b16-8c9d-3cfb26c3084c | tenant1-virtio-0-129 | ACTIVE | internal0-net0-1=10.1.1.212; tenant1-mgmt-net=192.168.100.150; tenant1-net1=172.16.1.180 |       | virtio   |

[2019-07-14 12:27:54,084] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.200.244 failed.

{"log":"2019-07-14 12:27:38.893 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4243fe83-9615-4416-a4dd-af0cb5ebe8bd updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e0a1f447-786d-41fc-815a-839d1d9644f5', 'segmentation_id': 808, 'fixed_ips': [{'subnet_id': '59652421-1905-4e5b-acbb-e272bbcb3549', 'ip_address': '10.1.1.153'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:0e:2a:96', 'device': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'port_security_enabled': True, 'port_id': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.8942472Z"}
{"log":"2019-07-14 12:27:38.893 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4243fe83-9615-4416-a4dd-af0cb5ebe8bd updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e0a1f447-786d-41fc-815a-839d1d9644f5', 'segmentation_id': 808, 'fixed_ips': [{'subnet_id': '59652421-1905-4e5b-acbb-e272bbcb3549', 'ip_address': '10.1.1.153'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:0e:2a:96', 'device': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'port_security_enabled': True, 'port_id': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.894515973Z"}
{"log":"2019-07-14 12:27:38.895 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4ed40235-c566-472d-acc1-ba074013f9f2 updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'ebe48295-5406-4aef-9a9f-b1d6d662ff1d', 'segmentation_id': 879, 'fixed_ips': [{'subnet_id': '298e165f-004f-4915-8c5b-737b74a5e176', 'ip_address': '192.168.200.244'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:bd:c1:a2', 'device': '4ed40235-c566-472d-acc1-ba074013f9f2', 'port_security_enabled': True, 'port_id': '4ed40235-c566-472d-acc1-ba074013f9f2', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.896118768Z"}
{"log":"2019-07-14 12:27:38.895 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4ed40235-c566-472d-acc1-ba074013f9f2 updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'ebe48295-5406-4aef-9a9f-b1d6d662ff1d', 'segmentation_id': 879, 'fixed_ips': [{'subnet_id': '298e165f-004f-4915-8c5b-737b74a5e176', 'ip_address': '192.168.200.244'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:bd:c1:a2', 'device': '4ed40235-c566-472d-acc1-ba074013f9f2', 'port_security_enabled': True, 'port_id': '4ed40235-c566-472d-acc1-ba074013f9f2', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.896388073Z"}
{"log":"2019-07-14 12:27:38.897 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port be60b3b5-edaf-4e12-83f1-953ad3169241 updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e8239830-46c8-4683-8cbf-149864ce1333', 'segmentation_id': 872, 'fixed_ips': [{'subnet_id': '8d156550-05d1-4683-aa70-65ea0df99d71', 'ip_address': '172.18.0.244'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:b8:b1:87', 'device': 'be60b3b5-edaf-4e12-83f1-953ad3169241', 'port_security_enabled': True, 'port_id': 'be60b3b5-edaf-4e12-83f1-953ad3169241', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.897887403Z"}
{"log":"2019-07-14 12:27:38.897 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port be60b3b5-edaf-4e12-83f1-953ad3169241 updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e8239830-46c8-4683-8cbf-149864ce1333', 'segmentation_id': 872, 'fixed_ips': [{'subnet_id': '8d156550-05d1-4683-aa70-65ea0df99d71', 'ip_address': '172.18.0.244'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:b8:b1:87', 'device': 'be60b3b5-edaf-4e12-83f1-953ad3169241', 'port_security_enabled': True, 'port_id': 'be60b3b5-edaf-4e12-83f1-953ad3169241', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.89817973Z"}
{"log":"2019-07-14 12:27:38.907 90 INFO neutron.agent.securitygroups_rpc [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Refresh firewall rules\n","stream":"stdout","time":"2019-07-14T12:27:38.90746363Z"}
{"log":"2019-07-14 12:27:38.907 90 INFO neutron.agent.securitygroups_rpc [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Refresh firewall rules\n","stream":"stdout","time":"2019-07-14T12:27:38.907775408Z"}
{"log":"2019-07-14 12:27:46.290 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] process_network_ports - iteration:6512 - agent port security group processed in 7.400\n","stream":"stdout","time":"2019-07-14T12:27:46.2913484Z"}
{"log":"2019-07-14 12:27:46.290 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] process_network_ports - iteration:6512 - agent port security group processed in 7.400\n","stream":"stdout","time":"2019-07-14T12:27:46.291712404Z"}
{"log":"2019-07-14 12:27:47.396 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Configuration for devices up ['4243fe83-9615-4416-a4dd-af0cb5ebe8bd', '4ed40235-c566-472d-acc1-ba074013f9f2', 'be60b3b5-edaf-4e12-83f1-953ad3169241'] and devices down [] completed.\n","stream":"stdout","time":"2019-07-14T12:27:47.39736217Z"}
{"log":"2019-07-14 12:27:47.396 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Configuration for devices up ['4243fe83-9615-4416-a4dd-af0cb5ebe8bd', '4ed40235-c566-472d-acc1-ba074013f9f2', 'be60b3b5-edaf-4e12-83f1-953ad3169241'] and devices down [] completed.\n","stream":"stdout","time":"2019-07-14T12:27:47.397699444Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server [-] Exception during message handling: NoSuchMethod: Endpoint does not support RPC method pod_health_probe_method_ignore_errors\n","stream":"stdout","time":"2019-07-14T12:28:37.3867528Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server Traceback (most recent call last):\n","stream":"stdout","time":"2019-07-14T12:28:37.386791793Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/server.py\", line 166, in _process_incoming\n","stream":"stdout","time":"2019-07-14T12:28:37.386800743Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)\n","stream":"stdout","time":"2019-07-14T12:28:37.386808515Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py\", line 274, in dispatch\n","stream":"stdout","time":"2019-07-14T12:28:37.386814425Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server     raise NoSuchMethod(method)\n","stream":"stdout","time":"2019-07-14T12:28:37.386819882Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server NoSuchMethod: Endpoint does not support RPC method pod_health_probe_method_ignore_errors\n","stream":"stdout","time":"2019-07-14T12:28:37.386824972Z"}
{"log":"2019-07-14 12:28:37.386 90 ERROR oslo_messaging.rpc.server \n","stream":"stdout","time":"2019-07-14T12:28:37.386830736Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server [-] Exception during message handling: NoSuchMethod: Endpoint does not support RPC method pod_health_probe_method_ignore_errors\n","stream":"stdout","time":"2019-07-14T12:30:07.384066913Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server Traceback (most recent call last):\n","stream":"stdout","time":"2019-07-14T12:30:07.384127447Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/server.py\", line 166, in _process_incoming\n","stream":"stdout","time":"2019-07-14T12:30:07.384136597Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)\n","stream":"stdout","time":"2019-07-14T12:30:07.384142556Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py\", line 274, in dispatch\n","stream":"stdout","time":"2019-07-14T12:30:07.38416026Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server     raise NoSuchMethod(method)\n","stream":"stdout","time":"2019-07-14T12:30:07.384167659Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server NoSuchMethod: Endpoint does not support RPC method pod_health_probe_method_ignore_errors\n","stream":"stdout","time":"2019-07-14T12:30:07.384172687Z"}
{"log":"2019-07-14 12:30:07.382 90 ERROR oslo_messaging.rpc.server \n","stream":"stdout","time":"2019-07-14T12:30:07.384177607Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server [-] Exception during message handling: NoSuchMethod: Endpoint does not support RPC method pod_health_probe_method_ignore_errors\n","stream":"stdout","time":"2019-07-14T12:31:37.402213554Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server Traceback (most recent call last):\n","stream":"stdout","time":"2019-07-14T12:31:37.402264726Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/server.py\", line 166, in _process_incoming\n","stream":"stdout","time":"2019-07-14T12:31:37.402274632Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)\n","stream":"stdout","time":"2019-07-14T12:31:37.402286505Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py\", line 274, in dispatch\n","stream":"stdout","time":"2019-07-14T12:31:37.402292392Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server     raise NoSuchMethod(method)\n","stream":"stdout","time":"2019-07-14T12:31:37.402297149Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server NoSuchMethod: Endpoint does not support RPC method pod_health_probe_method_ignore_errors\n","stream":"stdout","time":"2019-07-14T12:31:37.402301621Z"}
{"log":"2019-07-14 12:31:37.401 90 ERROR oslo_messaging.rpc.server \n","stream":"stdout","time":"2019-07-14T12:31:37.402305829Z"}

console.log
...
[[32m  OK  [0m] Started Titanium Cloud Filesystem Server.
[   68.004632] systemd[1]: Started Titanium Cloud Filesystem Server.
[   68.063331] cloud-init[1171]: Cloud-init v. 0.7.9 running 'init' at Sun, 14 Jul 2019 12:28:50 +0000. Up 68.02 seconds.
[   68.089767] cloud-init[1171]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[   68.095756] cloud-init[1171]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   68.101521] cloud-init[1171]: ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
[   68.106997] cloud-init[1171]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   68.112746] cloud-init[1171]: ci-info: |  lo:   |  True | 127.0.0.1 | 255.0.0.0 |   .   |         .         |
[   68.117191] cloud-init[1171]: ci-info: |  lo:   |  True |     .     |     .     |   d   |         .         |
[   68.124350] cloud-init[1171]: ci-info: | eth1:  | False |     .     |     .     |   .   | fa:16:3e:b8:b1:87 |
[   68.129921] cloud-init[1171]: ci-info: | eth2:  | False |     .     |     .     |   .   | fa:16:3e:0e:2a:96 |
[   68.135506] cloud-init[1171]: ci-info: | eth0:  |  True |     .     |     .     |   .   | fa:16:3e:bd:c1:a2 |
[   68.140852] cloud-init[1171]: ci-info: | eth0:  |  True |     .     |     .     |   d   | fa:16:3e:bd:c1:a2 |
[   68.146442] cloud-init[1171]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   68.184866] cloud-init[1171]: 2019-07-14 08:28:50,527 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/30s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x338c9d0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]
[   68.255034] postfix/postfix-script[1333]: starting the Postfix mail system
[   68.271773] postfix/master[1335]: daemon started -- version 2.10.1, configuration /etc/postfix
[[32m  OK  [0m] Started Postfix Mail Transport Agent.
[   68.278948] systemd[1]: Started Postfix Mail Transport Agent.
[   69.187872] cloud-init[1171]: 2019-07-14 08:28:51,530 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/30s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x338cbd0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-18:

#8

20190714_12-31-test_kpi_live_migrate[virtio].txt Edit (94.6 KiB, text/plain)

Wendy Mitchell (wmitchellwr) on 2019-07-18

tags:

added: stx.regression

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-18:

#9

Peng, please provide the compute logs as requested above.
Wendy, please also provide the logs. This is the only way to determine if what you saw is the same as the issue originally reported in this bug.

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-31:

#10

The console output logs for the following instance have been attached.
name='instance-00000171' uuid=dafd6ed4-584b-42dc-9a3c-392a041a7fd0

Attaching additional logs as requested
FAIL 20190714 12:25:41 test_kpi_live_migrate[virtio]
Lab: WCP_63_66
Load: 20190713T013000Z

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-31:

#11

controller-0_20190714.182956.tgz Edit (49.1 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-31:

#12

controller-1_20190714.182956.tgz Edit (47.3 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-31:

#13

compute-0_20190714.182956.tgz Edit (20.1 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-31:

#14

compute-1_20190714.182956.tgz Edit (19.7 MiB, application/x-tar)

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-06:

#15

ALL_NODES_20190806.130233.tar Edit (73.3 MiB, application/x-tar)

Download full text (5.5 KiB)

Seen this issue multiple times in past week.
In today's sanity, it started to happen on a 2+2 system after a compute lock/unlock test, and went away after another test that involves lock/unlock of both computes.

# lock/unlock compute-1
[2019-08-06 09:28:39,579] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-08-06 09:29:12,349] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-1'

# launch a vm and ping failed.
[2019-08-06 09:38:42,027] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=2 --image=fdfe03a5-6e99-4f9d-80a2-fb84aa3e0505 --bootable vol-tis-centos-guest-3'

[2019-08-06 09:42:30,383] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.100.174'

# VM console log shows no IP is assigned on eth0 and metadata server is unreachable from VM
[ 69.114638] cloud-init[1146]: Cloud-init v. 0.7.9 running 'init' at Tue, 06 Aug 2019 09:40:12 +0000. Up 69.08 seconds.
[ 69.143033] cloud-init[1146]: ci-info: +++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++
[ 69.146783] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 69.151250] cloud-init[1146]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 69.156456] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 69.164276] cloud-init[1146]: ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . | . |
[ 69.173020] cloud-init[1146]: ci-info: | lo: | True | . | . | d | . |
[ 69.178813] cloud-init[1146]: ci-info: | eth0: | True | . | . | . | fa:16:3e:06:6d:c0 |
[ 69.185948] cloud-init[1146]: ci-info: | eth0: | True | . | . | d | fa:16:3e:06:6d:c0 |
[ 69.190294] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+

[ 70.231039] cloud-init[1146]: 2019-08-06 05:40:13,541 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/30s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x372abd0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))...

Seen this issue multiple times in past week. 
In today's sanity, it started to happen on a 2+2 system after a compute lock/unlock test, and went away after another test that involves lock/unlock of both computes.

# lock/unlock compute-1
[2019-08-06 09:28:39,579] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-08-06 09:29:12,349] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-1'

# launch a vm and ping failed.
[2019-08-06 09:38:42,027] 301  DEBUG MainThread ssh.send    :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=2 --image=fdfe03a5-6e99-4f9d-80a2-fb84aa3e0505 --bootable vol-tis-centos-guest-3'

[2019-08-06 09:42:30,383] 301  DEBUG MainThread ssh.send    :: Send 'ping -c 3 192.168.100.174'

# VM console log shows no IP is assigned on eth0 and metadata server is unreachable from VM
[   69.114638] cloud-init[1146]: Cloud-init v. 0.7.9 running 'init' at Tue, 06 Aug 2019 09:40:12 +0000. Up 69.08 seconds.
[   69.143033] cloud-init[1146]: ci-info: +++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++
[   69.146783] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[   69.151250] cloud-init[1146]: ci-info: | Device |  Up  |  Address  |    Mask   | Scope |     Hw-Address    |
[   69.156456] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[   69.164276] cloud-init[1146]: ci-info: |  lo:   | True | 127.0.0.1 | 255.0.0.0 |   .   |         .         |
[   69.173020] cloud-init[1146]: ci-info: |  lo:   | True |     .     |     .     |   d   |         .         |
[   69.178813] cloud-init[1146]: ci-info: | eth0:  | True |     .     |     .     |   .   | fa:16:3e:06:6d:c0 |
[   69.185948] cloud-init[1146]: ci-info: | eth0:  | True |     .     |     .     |   d   | fa:16:3e:06:6d:c0 |
[   69.190294] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+

[   70.231039] cloud-init[1146]: 2019-08-06 05:40:13,541 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/30s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x372abd0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]

# After lock/unlock of both computes, the issue went away.
[2019-08-06 11:46:52,269] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-1'

[2019-08-06 11:57:54,934] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

# Launch a vm and it is reachable from NatBox
-os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-08-06 12:10:37,523] 423  DEBUG MainThread ssh.expect  :: Output: 
+--------------------------------------+------------------------------+--------+-------------------------------------------------------------+-------+---------------+
| ID                                   | Name                         | Status | Networks                                                    | Image | Flavor        |
+--------------------------------------+------------------------------+--------+-------------------------------------------------------------+-------+---------------+
| d09f462e-5603-4a4a-8734-fec9ac95984c | tenant1-mempool_configured-7 | ACTIVE | tenant1-mgmt-net=192.168.100.154; tenant2-net0=172.18.0.241 |       | flavor-mem10g |
+--------------------------------------+------------------------------+--------+-------------------------------------------------------------+-------+---------------+
[sysadmin@controller-1 ~(keystone_admin)]$

[2019-08-06 12:10:37,627] 301  DEBUG MainThread ssh.send    :: Send 'ping -c 3 192.168.100.154'
PING 192.168.100.154 (192.168.100.154) 56(84) bytes of data.
64 bytes from 192.168.100.154: icmp_seq=1 ttl=62 time=0.571 ms
64 bytes from 192.168.100.154: icmp_seq=2 ttl=62 time=0.351 ms
64 bytes from 192.168.100.154: icmp_seq=3 ttl=62 time=0.461 ms

--- 192.168.100.154 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.351/0.461/0.571/0.089 ms
svc-cgcsauto@tis-lab-nat-box:~$

New logs are attached.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-06:

#16

Updated the title to better reflect the issue.

summary:

- tenant-mgmt-net not reachable from external network
+ Neutron dhcp not coming up after lock unlock compute host

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-08-09:

#17

How frequently is this occurring? How many labs is this occurring on? I haven't been able to reproduce yet (~10 compute lock/unlock attempts).

Around the time this is occurring in the latest logs, I see that there is a platform CPU alarm on compute-1, and the report_state RPCs timing out so neutron-server thinks some agents are dead. I expect this would not be seen if you increase the platform CPUs on the worker node to more than just 0,44.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-08-12:

#18

we are seeing one of the two in every sanity run on multiple labs (wcp63-66, wcp7-12, etc)…

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-19:

#19

Seeing this issue also on ip 20-27 in attempting to run this testcase
nova/test_migrate_vms.py::test_live_migrate_vm_positive[remote-0-0-None-2-volume-False]
The instance appeared to launch successfully but did not get an ip (ping test failed)

flavor migration_test has

Flavor ID 0406eb20-c8d1-417b-b0e3-fe5a3ac00874
RAM 1GB
VCPUs 2 VCPU
Disk2GB
aggregate_instance_extra_specs:stx_storage=remote
hw:mem_page_size=2048

Instance
Name tenant1-migration_test-1
ID 24649d91-ff05-4b95-bb05-98a44ae22aa5
Project ID d836a7dacf224bad9e7c5b007174e712
Status Active
Locked False
Availability Zone nova
Created Aug. 19, 2019
(~19:30:18)
Host compute-2
tenant1 compute-2 tenant1-migration_test-1
• tenant1-net8 172.16.8.170
• tenant1-mgmt-net 192.168.104.11 migration_test
Active None Running

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-19:

#20

see neutron-dhcp-agent-compute-2
{"log":"2019-08-19 17:30:44.420 15 ERROR neutron.agent.linux.dhcp [req-4f537417-f064-445d-b36a-89a2e0416261 - - - - -] Failed to start DHCP process for network 162591ee-f650-4ef9-8113-0cdb065721dd: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:44.421121664Z"}
{"log":"2019-08-19 17:30:44.420 15 ERROR neutron.agent.linux.dhcp [req-4f537417-f064-445d-b36a-89a2e0416261 - - - - -] Failed to start DHCP process for network 162591ee-f650-4ef9-8113-0cdb065721dd: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:44.421604812Z"}
{"log":"2019-08-19 17:30:46.061 15 ERROR neutron.agent.linux.dhcp [req-a2370ffb-a5f9-4faf-8071-e58e59af6778 - - - - -] Failed to start DHCP process for network 09eeee69-47d9-44ee-bb45-c8b90fa17a8b: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:46.062501564Z"}
{"log":"2019-08-19 17:30:46.061 15 ERROR neutron.agent.linux.dhcp [req-a2370ffb-a5f9-4faf-8071-e58e59af6778 - - - - -] Failed to start DHCP process for network 09eeee69-47d9-44ee-bb45-c8b90fa17a8b: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:46.063056385Z"}
{"log":"2019-08-19 17:30:49.214 15 ERROR neutron.agent.linux.dhcp [req-29346966-669d-407c-b8ad-bbfff93c03ef - - - - -] Failed to start DHCP process for network 1a9f7e19-5e30-47dc-8dfb-a870e2967ac7: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:49.215251205Z"}

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-19:

#21

compute-2_20190819.194919.tgz Edit (21.9 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-19:

#22

controller-0_20190819.194919.tgz Edit (59.7 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-19:

#23

controller-1_20190819.194919.tgz Edit (46.0 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-21:

#24

nova regression testcases failures due to ping failures
h/w Lab: IP_33_36
Load: 2019-08-16_20-59-00

eg, FAIL test_force_lock_with_mig_vms

Yang Liu (yliu12) on 2019-09-09

tags:

added: stx.retestneeded

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-09-19:

#25

This was seen in systems that were experiencing critical CPU alarms after an unlock.
An operation in neutron (enabling dhcp for a network) that should only take a few seconds at most, was exceeding the wait_until_true timeout of 60 seconds, so it stops that and dhcp is not functional for that network. This can be recovered by removing/readding the dhcp server to the dhcp agent.
I expect that this can be mitigated by addressing the root cause of the excessive platform cpu usage during an unlock.

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-11-12:

#26

This timeout was increased from 60 seconds to 300 seconds in neutron stable/train:
https://review.opendev.org/#/c/692376/

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-14:

#27

Given that this is now mitigated in openstack train, I am marking this as Fix Released for stx.3.0. There is no plan to back-port the fix to stx.2.0 as this would require us to have a custom neutron image for starlingx which is against our strategy.

tags:	added: stx.3.0 stx.distro.openstack removed: stx.2.0
Changed in starlingx:
status:	Triaged → Fix Released

Revision history for this message

Maria Guadalupe Perez Ibara (maria-gp) wrote on 2019-11-26:

#28

I verified this behavior and i can not ping the network, however this testing is blocked by "neutron router external gateways unreachable - https://bugs.launchpad.net/starlingx/+bug/1841660 "

Tested on Build_ID="20191122T023000Z"

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-28:

#29

Maria, please open a new LP for the issue you are encountering. This issue was specific to stein as per the notes above.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-12-11:

#30

Did not see this issue since train.

tags:

removed: stx.retestneeded

StarlingX

Neutron dhcp not coming up after lock unlock compute host

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches