neutron router external gateways unreachable

Bug #1841660 reported by Peng Peng on 2019-08-27
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Critical
zhipeng liu

Bug Description

Brief Description
-----------------
During sanity run, many TCs were failed by ping tenant management network. After investigation, we observed that actually system router external gateway is not reachable from natbox.

Severity
--------
Critical

Steps to Reproduce
------------------
ping route external gateway from natbox

TC-name: mtc/test_services_persists_over_reboot.py::test_system_persist_over_host_reboot[controller]

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name: WCP_63-66

Branch/Pull Time/Commit
-----------------------
stx 2.0 as of 2019-08-26_20-59-00

Last Pass
---------
on master 2019-08-20_20-59-00

Timestamp/Logs
--------------
[2019-08-27 09:00:16,712] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-08-27 09:00:19,163] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+--------------+--------+-------------------------------------------------------------+-------+----------------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------+--------+-------------------------------------------------------------+-------+----------------------+
| 9dd1ee5e-271d-4fec-85e8-fd3d6fea4938 | tenant1-vm-1 | ACTIVE | tenant1-mgmt-net=192.168.100.219; tenant1-net1=172.16.1.141 | | flavor-default-size2 |
+--------------------------------------+--------------+--------+-------------------------------------------------------------+-------+----------------------+
[sysadmin@controller-1 ~(keystone_admin)]$
[2019-08-27 09:00:19,163] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-27 09:00:19,266] 423 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-1 ~(keystone_admin)]$
[2019-08-27 09:00:19,267] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.100.219']
[2019-08-27 09:00:19,267] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'9dd1ee5e-271d-4fec-85e8-fd3d6fea4938': ['192.168.100.219']}
[2019-08-27 09:00:19,267] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.100.219 from host 128.224.186.181
[2019-08-27 09:00:19,267] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-27 09:00:19,267] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.100.219'
[2019-08-27 09:00:22,369] 423 DEBUG MainThread ssh.expect :: Output:
PING 192.168.100.219 (192.168.100.219) 56(84) bytes of data.
From 10.10.100.1 icmp_seq=1 Destination Host Unreachable
From 10.10.100.1 icmp_seq=2 Destination Host Unreachable
From 10.10.100.1 icmp_seq=3 Destination Host Unreachable

[sysadmin@controller-1 ~(keystone_admin)]$ source ./openrc.admin
[sysadmin@controller-1 ~(keystone_admin)]$ neutron router-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| id | name | tenant_id | external_gateway_info | distributed | ha |
+--------------------------------------+----------------+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| 7992dd8e-6243-4cce-a3c7-1d6a3aeb9145 | tenant1-router | a56940f0c987491bb3c04369ad8e6c44 | {"network_id": "fe0f0fc0-309f-4b9e-8906-71be1d2cbe65", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "b7450285-2b4c-4cb6-aa3e-67fbb6a39efb", "ip_address": "10.10.100.2"}]} | True | False |
| a44ac5e4-e2af-47db-80f7-e7cc7e8623a1 | tenant2-router | 4b452ac33f5b47e1a8a422851671fd44 | {"network_id": "fe0f0fc0-309f-4b9e-8906-71be1d2cbe65", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "b7450285-2b4c-4cb6-aa3e-67fbb6a39efb", "ip_address": "10.10.100.3"}]} | False | False |
+--------------------------------------+----------------+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+

ppeng@tis-lab-nat-box:~$ ping 10.10.100.2
PING 10.10.100.2 (10.10.100.2) 56(84) bytes of data.
From 10.10.100.1 icmp_seq=1 Destination Host Unreachable
From 10.10.100.1 icmp_seq=2 Destination Host Unreachable
From 10.10.100.1 icmp_seq=3 Destination Host Unreachable

--- 10.10.100.2 ping statistics ---
6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4999ms
pipe 3
ppeng@tis-lab-nat-box:~$ ping 10.10.100.3
PING 10.10.100.3 (10.10.100.3) 56(84) bytes of data.

--- 10.10.100.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1008ms

Test Activity
-------------
Sanity

Peng Peng (ppeng) wrote :
Yang Liu (yliu12) on 2019-08-27
summary: - router external gateways unreachable
+ neutron router external gateways unreachable
Numan Waheed (nwaheed) on 2019-08-28
tags: added: stx.retestneeded
Dariush Eslimi (deslimi) on 2019-08-28
Changed in starlingx:
assignee: nobody → Forrest Zhao (forrest.zhao)
importance: Undecided → High
status: New → Triaged
tags: added: stx.3.0 stx.networking
Changed in starlingx:
assignee: Forrest Zhao (forrest.zhao) → YaoLe (yaole)
Yosief Gebremariam (ygebrema) wrote :

A similar issue observed in BUILD_ID="r/stx.2.0":
Lab: R720-1_2

ygebrema@tis-lab-nat-box:~$ ping 192.168.10.3
PING 192.168.10.3 (192.168.10.3) 56(84) bytes of data.
From 192.168.10.1 icmp_seq=1 Destination Host Unreachable
From 192.168.10.1 icmp_seq=2 Destination Host Unreachable
From 192.168.10.1 icmp_seq=3 Destination Host Unreachable

Logs are attached.

YaoLe (yaole) on 2019-11-13
Changed in starlingx:
status: Triaged → In Progress
Peng Peng (ppeng) wrote :

The issue was reproduced on
Lab: WCP_76_77
Load: 2019-11-18_20-00-00

[sysadmin@controller-1 ~(keystone_admin)]$ date
Tue Nov 19 19:11:39 UTC 2019
[sysadmin@controller-1 ~(keystone_admin)]$ neutron router-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------+----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| id | name | tenant_id | external_gateway_info | distributed | ha |
+--------------------------------------+----------------+----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| 89f1364a-f7c6-4413-a679-575f7d0952b8 | tenant2-router | 2c33168aeaa84ceda674ddaed7f83305 | {"network_id": "f106ec7d-b786-49af-ac9f-105f4b54867c", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "20cb6bf4-1201-4f54-a301-745565e6eb7a", "ip_address": "192.168.41.3"}]} | True | False |
| f7794288-c825-4cbf-82e5-8d044e080aff | tenant1-router | 70a67b834735446cb67ab72638786f9b | {"network_id": "f106ec7d-b786-49af-ac9f-105f4b54867c", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "20cb6bf4-1201-4f54-a301-745565e6eb7a", "ip_address": "192.168.41.2"}]} | False | False |
+--------------------------------------+----------------+----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+

ppeng@tis-lab-nat-box:~$ ping 192.168.41.2
PING 192.168.41.2 (192.168.41.2) 56(84) bytes of data.

--- 192.168.41.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1008ms

ppeng@tis-lab-nat-box:~$ ping 192.168.41.3
PING 192.168.41.3 (192.168.41.3) 56(84) bytes of data.

--- 192.168.41.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

ppeng@tis-lab-nat-box:~$

Peng Peng (ppeng) wrote :

The issue is also reproduced on
Lab:wcp_3-6
load: 2019-11-17_20-00-00

Yang Liu (yliu12) wrote :

Just to clarify that in comments #3 and #4, the neutron routers' external gateways were not reachable since fresh install. There was no host reboot involved.

YaoLe (yaole) wrote :

Hi, Peng

What guide you use to setup the natbox and starlingX.

Ping from 10.10.100.1, what is that ip.

Could you please show the output of 'route -n'

Thanks

YaoLe (yaole) wrote :

And 'openstack subnet show $SUBNET_ID' to print the subnet info in external_gateway_info

Peng Peng (ppeng) wrote :
Download full text (7.8 KiB)

Issue was reproduced on
R720-3-7
2019-11-19_20-00-00

[sysadmin@controller-0 ~(keystone_admin)]$ neutron router-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------+----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| id | name | tenant_id | external_gateway_info | distributed | ha |
+--------------------------------------+----------------+----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| 3f39af1d-e691-48b4-a3a8-49f60da3db4d | tenant1-router | 85403e6228ab40b4b5b30f18ae60c599 | {"network_id": "eb6237df-20e6-42b1-91eb-2c07d61ebb4a", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "de81659f-28bd-46ef-ba8a-3eef4bdfeff9", "ip_address": "192.168.13.2"}]} | False | False |
| cee41cb9-af1f-4f21-8f37-cd0de52a2621 | tenant2-router | d1a7c6f6aae6407993c2632133e97918 | {"network_id": "eb6237df-20e6-42b1-91eb-2c07d61ebb4a", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "de81659f-28bd-46ef-ba8a-3eef4bdfeff9", "ip_address": "192.168.13.3"}]} | False | False |
+--------------------------------------+----------------+--

[sysadmin@controller-0 ~(keystone_admin)]$ openstack subnet list | grep external
| de81659f-28bd-46ef-ba8a-3eef4bdfeff9 | external-subnet0 | eb6237df-20e6-42b1-91eb-2c07d61ebb4a | 192.168.13.0/24 |
[sysadmin@controller-0 ~(keystone_admin)]$ openstack subnet show de81659f-28bd-46ef-ba8a-3eef4bdfeff9
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| allocation_pools | 192.168.13.2-192.168.13.254 |
| cidr | 192.168.13.0/24 |
| created_at | 2019-11-20T06:40:55Z ...

Read more...

Peng Peng (ppeng) wrote :
Ghada Khalil (gkhalil) wrote :

Raising the priority to critical.
It appears that neutron ports are completely down with openstack train and there is no connectivity at all

From Yang Liu:
[sysadmin@controller-0 ~(keystone_admin)]$ openstack port list --router=tenant1-router
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+
| 103533f0-ce1b-4b99-afc9-86340a2b960a | | fa:16:3e:30:9a:73 | ip_address='192.168.113.65', subnet_id='67598367-e38f-4025-ae9c-f297ab339bfb' | DOWN |
| 95d298c5-a22f-43f7-a853-d3144099235e | | fa:16:3e:4e:10:9d | ip_address='192.168.113.33', subnet_id='a90bf1c0-2685-49b8-b5ae-1547c412a1b4' | DOWN |
| 9d2d676a-e788-48ea-b910-95c85f5e4037 | | fa:16:3e:af:52:f1 | ip_address='192.168.113.1', subnet_id='c782fd8c-41eb-48c3-9ac7-aa70f665351a' | DOWN |
| e934d6e5-bd11-4af7-b531-70f4ef500771 | | fa:16:3e:ca:60:9d | ip_address='192.168.13.2', subnet_id='de81659f-28bd-46ef-ba8a-3eef4bdfeff9' | DOWN |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+

@Ada, As previously discussed, we really need the daily sanity to cover some basic VM connectivity testing. This should have been caught as part of the pre-submission sanity.

Changed in starlingx:
importance: High → Critical
Joseph Richard (josephrichard) wrote :

it looks like openstack-helm is setting conf.neutron.oslo_concurrency.lock_path=/var/lib/neutron/tmp, which is overriding our default (using deprecated lock path) of conf.neutron.DEFAULT.lock_path=/var/run/neutron/lock

stx-openstack armada manifest should be updated to set 'lock_path: /var/run/neutron/lock' under oslo_concurrency rather than default.

https://opendev.org/starlingx/openstack-armada-app/src/branch/master/stx-openstack-helm/stx-openstack-helm/manifests/manifest.yaml#L1403

Ghada Khalil (gkhalil) wrote :

Just to clarify, this LP will be used to address the recent issue introduced in openstack train. The original occurrence on stein will not be investigated.

tags: added: stx.distro.openstack
Changed in starlingx:
assignee: YaoLe (yaole) → yong hu (yhu6)
Ghada Khalil (gkhalil) wrote :

A simple way to reproduce the issue is as follows:
- Bring up a system
- Apply the stx-openstack application
- Check the status of the neutron ports. They will be down.
- Check the logs in the neutron & nova pods. There are lots of errors related to oslo
       kubectl -n openstack logs neutron-l3-agent-controller-0-937646f6-rkwmk
       kubectl -n openstack logs nova-compute-controller-0-937646f6-gql98 nova-compute

yong hu (yhu6) wrote :

@Ghada, agreed this is a critical issue to fix.

Since this issue was reported before Train upgrade, what is specifically "the recent issue introduced in openstack train"?

zhipeng liu (zhipengs) wrote :

Below patch submitted.
https://review.opendev.org/#/c/695342/

Below error could not be seen anymore.
2019-11-20 03:09:00.172 24 ERROR neutron.agent.linux.iptables_manager [-] Failure applying iptables rules: OSError: [Errno 30] Read-only file system: '/var/lib/neutron/tmp'
2019-11-20 03:09:30.147 24 ERROR neutron.agent.l3.agent raise l3_exc.IpTablesApplyException(msg)
2019-11-20 03:09:30.147 24 ERROR neutron.agent.l3.agent IpTablesApplyException: Failure applying iptables rules

I can see router update finished in my test log.
2019-11-21 02:17:56.351 18 INFO neutron.agent.l3.agent [-] Finished a router update for 59e7e8e4-46e3-4a69-911e-0f0032457c29, update_id 3df534be-dc07-4829-ab28-1eef65aabb56.

Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
zhipeng liu (zhipengs) wrote :

Also checked all routers and ports, they are active!

Ghada Khalil (gkhalil) wrote :

@Yong, Sorry for the confusion. At first, we thought the issue in stein (originally reported in this LP) was the same as train, but they are not. As per my comments above, we will only focus on the train issue.

Reviewed: https://review.opendev.org/695342
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=8a98e4888d48d0523c670f48bd8bad12ab732a9b
Submitter: Zuul
Branch: master

commit 8a98e4888d48d0523c670f48bd8bad12ab732a9b
Author: zhipengl <email address hidden>
Date: Thu Nov 21 19:18:57 2019 +0800

    Fix the issue of neutron router external gateways unreachable

    The configuration item "conf.neutron.DEFAULT.lock_path" is not
    used anymore, we need to override
    "conf.neutron.oslo_concurrency.lock_path" to
    /var/run/neutron/lock

    Verified that in neutron-l3-agent-controller-0
    and nova-compute-controller-0, not see lots of errors anymore.
    Router update finished in neutron.agent.l3.agent

    closes-Bug: #1841660

    Change-Id: I9c62872d86ba8f92cb8380181bf91389767cba09
    Signed-off-by: zhipengl <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released

Tested on Build_ID="20191122T023000Z"

controller-0:~$ neutron router-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+-----------------+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| id | name | tenant_id | external_gateway_info | distributed | ha |
+--------------------------------------+-----------------+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
| 4d9acfbe-3ede-4a1f-bee1-90e50725ed0c | private-router0 | 99c4e2586c71446caace508f18b1c9e1 | {"network_id": "1be29e40-e183-4a5e-be4e-e5dad9e29f01", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "22bcb85a-2506-402f-9bd7-b13e3d07adae", "ip_address": "192.168.1.35"}]} | False | False |
| ba3f9cc7-5d3d-44df-b27d-076caaa3f914 | public-router0 | 99c4e2586c71446caace508f18b1c9e1 | {"network_id": "1be29e40-e183-4a5e-be4e-e5dad9e29f01", "enable_snat": false, "external_fixed_ips": [{"subnet_id": "22bcb85a-2506-402f-9bd7-b13e3d07adae", "ip_address": "192.168.1.206"}]} | False | False |
+--------------------------------------+-----------------+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------+
controller-0:~$ ping 192.168.1.35
PING 192.168.1.35 (192.168.1.35) 56(84) bytes of data.
From 192.168.200.1 icmp_seq=1 Destination Net Unreachable
From 192.168.200.1 icmp_seq=2 Destination Net Unreachable
From 192.168.200.1 icmp_seq=3 Destination Net Unreachable
From 192.168.200.1 icmp_seq=4 Destination Net Unreachable
^Z
[6]+ Stopped(SIGTSTP) ping 192.168.1.35
controller-0:~$ ping 192.168.1.206
PING 192.168.1.206 (192.168.1.206) 56(84) bytes of data.
From 192.168.200.1 icmp_seq=1 Destination Net Unreachable
From 192.168.200.1 icmp_seq=2 Destination Net Unreachable
From 192.168.200.1 icmp_seq=3 Destination Net Unreachable
From 192.168.200.1 icmp_seq=4 Destination Net Unreachable

zhipeng liu (zhipengs) wrote :

Hi Maria,

Neutron port disabled issue in train has already been fix in my patch.
However, original ping issue is still there.
According to Ghada's email, could you raise to new LP to track the issue
I think Yao Le will follow up the issue

Thanks!
Zhipeng

Peng Peng (ppeng) wrote :

Not seeing this issue recently

Yang Liu (yliu12) on 2019-12-11
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers