Neutron dhcp not coming up after lock unlock compute host

Bug #1836252 reported by Peng Peng on 2019-07-11
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
StarlingX
High
Joseph Richard

Bug Description

Brief Description
-----------------
tenant-mgmt-net not ping through from external network, like NatBox

Severity
--------
Major

Steps to Reproduce
------------------
When system is fully up, try to ping tenant-mgmt-net

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
stx master as of Lab Load: 20190711T013000Z

Last Pass
---------
Lab: WCP_113_121
Load: 20190705T013000Z

Timestamp/Logs
--------------
[2019-07-11 17:19:26,682] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-07-11 17:19:28,699] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+---------------------------+--------+----------------------------------------------------------------------------------------+-------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+---------------------------+--------+----------------------------------------------------------------------------------------+-------+-----------+
| da0d6db3-b043-4f06-8495-7614fcbe5d4f | tenant1-multiports_base-3 | ACTIVE | internal0-net0-1=10.1.1.136; tenant1-mgmt-net=192.168.87.81; tenant1-net2=172.16.2.227 | | dedicated |
+--------------------------------------+---------------------------+--------+----------------------------------------------------------------------------------------+-------+-----------+
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-11 17:19:28,699] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-07-11 17:19:28,802] 423 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-11 17:19:28,802] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.87.81']
[2019-07-11 17:19:28,802] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'da0d6db3-b043-4f06-8495-7614fcbe5d4f': ['192.168.87.81']}
[2019-07-11 17:19:28,802] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.87.81 from host 128.224.186.181
[2019-07-11 17:19:28,803] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-11 17:19:28,803] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.87.81'
[2019-07-11 17:19:31,912] 423 DEBUG MainThread ssh.expect :: Output:
PING 192.168.87.81 (192.168.87.81) 56(84) bytes of data.
From 10.10.87.2 icmp_seq=1 Destination Host Unreachable
From 10.10.87.2 icmp_seq=2 Destination Host Unreachable
From 10.10.87.2 icmp_seq=3 Destination Host Unreachable

--- 192.168.87.81 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2016ms
pipe 3
svc-cgcsauto@tis-lab-nat-box:~$

Test Activity
-------------
Sanity

Peng Peng (ppeng) wrote :
Matt Peters (mpeters-wrs) wrote :

Do you have the additional router reachability information normally collected for this type of issue? Was the VM just not reachable externally, i.e. did each VM have connectivity to the gateway and other VMs? Were the virtual router gateways reachable from an external endpoint? If you have this information, can you please add it to this LP.

Ghada Khalil (gkhalil) on 2019-07-15
tags: added: stx.networking
Changed in starlingx:
status: New → Incomplete
Ghada Khalil (gkhalil) on 2019-07-16
Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
Ghada Khalil (gkhalil) on 2019-07-16
tags: added: stx.2.0
tags: added: stx.sanity
Ghada Khalil (gkhalil) wrote :

As per Peng, issue was reproduced on another system (wcp63-66) using the 2019-07-16 cengn load.
Marking as stx.2.0 gating as there are now multiple occurrences; further investigation is required.

Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → High
Joseph Richard (josephrichard) wrote :

The issue was with dhcp not coming up on e195eaf6-9ccd-485d-a03c-bddc05fa9a96(tenant1-mgmt-net). e195eaf6-9ccd-485d-a03c-bddc05fa9a96 was scheduled to the dhcp agent on compute-1, but the dhcp agent was not being updated with this information, so did not launch a dnsmasq process for this port. There was a namespace for that e195eaf6-9ccd-485d-a03c-bddc05fa9a96, indicating that the network had previously been on that compute node, and was probably rescheduled away from it, and then back to it.

How frequently is this occuring?

Peng Peng (ppeng) wrote :

So far we observed twice.

Joseph Richard (josephrichard) wrote :

Do you have the compute logs from the first time you saw this?

Wendy Mitchell (wmitchellwr) wrote :
Download full text (15.8 KiB)

Issue apparent in weekly nova regression testcase test_kpi_live_migrate[virtio]

FAIL 20190714 12:25:41 test_kpi_live_migrate[virtio]
Lab: WCP_63_66
Load: 20190713T013000Z

For the following instance
name='instance-00000171' uuid=dafd6ed4-584b-42dc-9a3c-392a041a7fd0

[2019-07-14 12:27:47,762] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------+--------+------------------------------------------------------------------------------------------+-------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------+--------+------------------------------------------------------------------------------------------+-------+----------+
| dafd6ed4-584b-42dc-9a3c-392a041a7fd0 | tenant2-virtio-0-130 | ACTIVE | internal0-net0-1=10.1.1.153; tenant2-mgmt-net=192.168.200.244; tenant2-net0=172.18.0.244 | | virtio-2 |
| 97a93dbc-e59b-4b16-8c9d-3cfb26c3084c | tenant1-virtio-0-129 | ACTIVE | internal0-net0-1=10.1.1.212; tenant1-mgmt-net=192.168.100.150; tenant1-net1=172.16.1.180 | | virtio |

[2019-07-14 12:27:54,084] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.200.244 failed.

{"log":"2019-07-14 12:27:38.893 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4243fe83-9615-4416-a4dd-af0cb5ebe8bd updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e0a1f447-786d-41fc-815a-839d1d9644f5', 'segmentation_id': 808, 'fixed_ips': [{'subnet_id': '59652421-1905-4e5b-acbb-e272bbcb3549', 'ip_address': '10.1.1.153'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:0e:2a:96', 'device': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'port_security_enabled': True, 'port_id': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:38.8942472Z"}
{"log":"2019-07-14 12:27:38.893 90 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3a8081b8-6428-4b00-aba8-af53f5380eeb - - - - -] Port 4243fe83-9615-4416-a4dd-af0cb5ebe8bd updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': 'e0a1f447-786d-41fc-815a-839d1d9644f5', 'segmentation_id': 808, 'fixed_ips': [{'subnet_id': '59652421-1905-4e5b-acbb-e272bbcb3549', 'ip_address': '10.1.1.153'}], 'device_owner': u'compute:nova', 'physical_network': u'group0-data0', 'mac_address': 'fa:16:3e:0e:2a:96', 'device': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'port_security_enabled': True, 'port_id': '4243fe83-9615-4416-a4dd-af0cb5ebe8bd', 'network_type': u'vlan', 'security_groups': [u'1e77918c-b860-4fe5-89db-9b7c31b9d526']}\n","stream":"stdout","time":"2019-07-14T12:27:3...

tags: added: stx.regression
Ghada Khalil (gkhalil) wrote :

Peng, please provide the compute logs as requested above.
Wendy, please also provide the logs. This is the only way to determine if what you saw is the same as the issue originally reported in this bug.

Wendy Mitchell (wmitchellwr) wrote :

The console output logs for the following instance have been attached.
name='instance-00000171' uuid=dafd6ed4-584b-42dc-9a3c-392a041a7fd0

Attaching additional logs as requested
FAIL 20190714 12:25:41 test_kpi_live_migrate[virtio]
Lab: WCP_63_66
Load: 20190713T013000Z

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Yang Liu (yliu12) wrote :
Download full text (5.5 KiB)

Seen this issue multiple times in past week.
In today's sanity, it started to happen on a 2+2 system after a compute lock/unlock test, and went away after another test that involves lock/unlock of both computes.

# lock/unlock compute-1
[2019-08-06 09:28:39,579] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-08-06 09:29:12,349] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-1'

# launch a vm and ping failed.
[2019-08-06 09:38:42,027] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=2 --image=fdfe03a5-6e99-4f9d-80a2-fb84aa3e0505 --bootable vol-tis-centos-guest-3'

[2019-08-06 09:42:30,383] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.100.174'

# VM console log shows no IP is assigned on eth0 and metadata server is unreachable from VM
[ 69.114638] cloud-init[1146]: Cloud-init v. 0.7.9 running 'init' at Tue, 06 Aug 2019 09:40:12 +0000. Up 69.08 seconds.
[ 69.143033] cloud-init[1146]: ci-info: +++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++
[ 69.146783] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 69.151250] cloud-init[1146]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 69.156456] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 69.164276] cloud-init[1146]: ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . | . |
[ 69.173020] cloud-init[1146]: ci-info: | lo: | True | . | . | d | . |
[ 69.178813] cloud-init[1146]: ci-info: | eth0: | True | . | . | . | fa:16:3e:06:6d:c0 |
[ 69.185948] cloud-init[1146]: ci-info: | eth0: | True | . | . | d | fa:16:3e:06:6d:c0 |
[ 69.190294] cloud-init[1146]: ci-info: +--------+------+-----------+-----------+-------+-------------------+

[ 70.231039] cloud-init[1146]: 2019-08-06 05:40:13,541 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/30s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x372abd0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))...

Read more...

Yang Liu (yliu12) wrote :

Updated the title to better reflect the issue.

summary: - tenant-mgmt-net not reachable from external network
+ Neutron dhcp not coming up after lock unlock compute host
Joseph Richard (josephrichard) wrote :

How frequently is this occurring? How many labs is this occurring on? I haven't been able to reproduce yet (~10 compute lock/unlock attempts).

Around the time this is occurring in the latest logs, I see that there is a platform CPU alarm on compute-1, and the report_state RPCs timing out so neutron-server thinks some agents are dead. I expect this would not be seen if you increase the platform CPUs on the worker node to more than just 0,44.

Peng Peng (ppeng) wrote :

we are seeing one of the two in every sanity run on multiple labs (wcp63-66, wcp7-12, etc)…

Wendy Mitchell (wmitchellwr) wrote :

Seeing this issue also on ip 20-27 in attempting to run this testcase
nova/test_migrate_vms.py::test_live_migrate_vm_positive[remote-0-0-None-2-volume-False]
The instance appeared to launch successfully but did not get an ip (ping test failed)

flavor migration_test has

Flavor ID 0406eb20-c8d1-417b-b0e3-fe5a3ac00874
RAM 1GB
VCPUs 2 VCPU
Disk2GB
aggregate_instance_extra_specs:stx_storage=remote
hw:mem_page_size=2048

Instance
Name tenant1-migration_test-1
ID 24649d91-ff05-4b95-bb05-98a44ae22aa5
Project ID d836a7dacf224bad9e7c5b007174e712
Status Active
Locked False
Availability Zone nova
Created Aug. 19, 2019
(~19:30:18)
Host compute-2
tenant1 compute-2 tenant1-migration_test-1
• tenant1-net8 172.16.8.170
• tenant1-mgmt-net 192.168.104.11 migration_test
Active None Running

Wendy Mitchell (wmitchellwr) wrote :

see neutron-dhcp-agent-compute-2
{"log":"2019-08-19 17:30:44.420 15 ERROR neutron.agent.linux.dhcp [req-4f537417-f064-445d-b36a-89a2e0416261 - - - - -] Failed to start DHCP process for network 162591ee-f650-4ef9-8113-0cdb065721dd: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:44.421121664Z"}
{"log":"2019-08-19 17:30:44.420 15 ERROR neutron.agent.linux.dhcp [req-4f537417-f064-445d-b36a-89a2e0416261 - - - - -] Failed to start DHCP process for network 162591ee-f650-4ef9-8113-0cdb065721dd: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:44.421604812Z"}
{"log":"2019-08-19 17:30:46.061 15 ERROR neutron.agent.linux.dhcp [req-a2370ffb-a5f9-4faf-8071-e58e59af6778 - - - - -] Failed to start DHCP process for network 09eeee69-47d9-44ee-bb45-c8b90fa17a8b: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:46.062501564Z"}
{"log":"2019-08-19 17:30:46.061 15 ERROR neutron.agent.linux.dhcp [req-a2370ffb-a5f9-4faf-8071-e58e59af6778 - - - - -] Failed to start DHCP process for network 09eeee69-47d9-44ee-bb45-c8b90fa17a8b: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:46.063056385Z"}
{"log":"2019-08-19 17:30:49.214 15 ERROR neutron.agent.linux.dhcp [req-29346966-669d-407c-b8ad-bbfff93c03ef - - - - -] Failed to start DHCP process for network 1a9f7e19-5e30-47dc-8dfb-a870e2967ac7: WaitTimeout: Timed out after 60 seconds\n","stream":"stdout","time":"2019-08-19T17:30:49.215251205Z"}

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :

nova regression testcases failures due to ping failures
h/w Lab: IP_33_36
Load: 2019-08-16_20-59-00

eg, FAIL test_force_lock_with_mig_vms

Yang Liu (yliu12) on 2019-09-09
tags: added: stx.retestneeded
Joseph Richard (josephrichard) wrote :

This was seen in systems that were experiencing critical CPU alarms after an unlock.
An operation in neutron (enabling dhcp for a network) that should only take a few seconds at most, was exceeding the wait_until_true timeout of 60 seconds, so it stops that and dhcp is not functional for that network. This can be recovered by removing/readding the dhcp server to the dhcp agent.
I expect that this can be mitigated by addressing the root cause of the excessive platform cpu usage during an unlock.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers