intermittently ALL VM's floating IP connection is disconnected, and can be reconnected after 5-6 minutes

Bug #1907175 reported by hojin kim
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Undecided
Unassigned

Bug Description

Current configuration: 49node centos 7.8(Kernel version = 3.10.0-1127.el7.x86_64)
                       kolla-ansible 9.2.1 (openvswitch - 2.12.0 / neutron-server 15.1.0)

Phenomenon: The floating IP connection is disconnected, and the connection becomes possible again after 5-6 minutes. Occurs by all vm on nodes.
The internal ip connection is not disconnected, and if openvswitch_vswitchd is restarted in case of failure, the problem is solved.
The public network, physnet1 (172.29.75.0~172.29.84.0), is tied in LACP(Bond_mode =4 )
 mode by VLAN, and the TENANT NETWORK is composed of vxlan. (Use DVR)
As a result of the ping tcpdump test, the network sends a ping to the node with the vm, but the vm does not respond.

=== ml2_conf.ini ============================================================
[root@2020c5lut006 neutron-server]# cat ml2_conf.ini
[ml2]
type_drivers = flat,vlan,vxlan
tenant_network_types = vxlan
mechanism_drivers = openvswitch,baremetal,l2population
extension_drivers = qos,port_security
path_mtu = 9000

[ml2_type_vlan]
network_vlan_ranges = physnet1

[ml2_type_flat]
flat_networks = *

[ml2_type_vxlan]
vni_ranges = 1:1000

[securitygroup]
firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver

[agent]
tunnel_types = vxlan
l2_population = true
arp_responder = true
enable_distributed_routing = True
extensions = qos

[ovs]
bridge_mappings = physnet1:br-ex,physnet2:br-cephfs,physnet3:br-api
datapath_type = system
ovsdb_connection = tcp:127.0.0.1:6640
local_ip = 20.21.2.101

==neutron.conf ====================================================================
[root@2020c5lut006 neutron-server]# cat neutron.conf
[DEFAULT]
debug = False
log_dir = /var/log/kolla/neutron
use_stderr = False
bind_host = 20.21.1.101
bind_port = 9696
api_paste_config = /usr/share/neutron/api-paste.ini
endpoint_type = internalURL
api_workers = 5
metadata_workers = 5
rpc_workers = 3
rpc_state_report_workers = 3
metadata_proxy_socket = /var/lib/neutron/kolla/metadata_proxy
interface_driver = openvswitch
allow_overlapping_ips = true
core_plugin = ml2
service_plugins = firewall_v2,qos,router
dhcp_agents_per_network = 2
l3_ha = true
max_l3_agents_per_router = 3
transport_url = rabbit://openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.101:5672,openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.102:5672,openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.103:5672//
router_distributed = True
ipam_driver = internal
global_physnet_mtu = 9000

[nova]
auth_url = http://20.21.1.100:35357
auth_type = password
project_domain_id = default
user_domain_id = default
region_name = RegionOne
project_name = service
username = nova
password = rChwHtVHMqLK3AHRkKfZ7rxiQ74Am8EJHWvbEyQt
endpoint_type = internal

[oslo_middleware]
enable_proxy_headers_parsing = True

[oslo_concurrency]
lock_path = /var/lib/neutron/tmp

[agent]
root_helper = sudo neutron-rootwrap /etc/neutron/rootwrap.conf

[database]
connection = mysql+pymysql://neutron:PZl2BQm7LesapA6Ks9lqOuUc6DU4kRHeSWwPNvH1@20.21.1.100:3306/neutron
max_retries = -1

[keystone_authtoken]
www_authenticate_uri = http://20.21.1.100:5000
auth_url = http://20.21.1.100:35357
auth_type = password
project_domain_id = default
user_domain_id = default
project_name = service
username = neutron
password = XjxBaFwek0aaKj0rLaqeUXqfp7lrNk5sdkIFGAeE
memcache_security_strategy = ENCRYPT
memcache_secret_key = w6eOcER3TlZzidSL7wjea2rnbMWGUlV7BiO3ls3J
memcached_servers = 20.21.1.101:11211,20.21.1.102:11211,20.21.1.103:11211

[oslo_messaging_notifications]
transport_url = rabbit://openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.101:5672,openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.102:5672,openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.103:5672//
driver = noop

[octavia]
base_url = http://20.21.1.100:9876

[placement]
auth_type = password
auth_url = http://20.21.1.100:35357
username = placement
password = s1VxNvJeh8CDOjeqa6hi8eF0QhQdDBp12SJdyfll
user_domain_name = Default
project_name = service
project_domain_name = Default
os_region_name = RegionOne
os_interface = internal

[privsep]
helper_command = sudo neutron-rootwrap /etc/neutron/rootwrap.conf privsep-helper

====================================================================

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Please give us some more informations about that bug:
1. where exactly packets are lost? Is icmp request getting to the vm, vm sends reply but it is dropped somewhere (where exactly?) or request don't even reach the VM (where it is dropped exactly?)

2. You said that restart of openvswitch_vswitchd helps - can You check if restart of neutron-ovs-agent would also help in the same way?

3. Can You compare Openflow rules from br-int and physical bridge br-ex when the issue happened and when it is fixed already? Are there any changes in those flows?

4. Do You see any errors in neutron-ovs-agent or neutron-l3-agent on the host in the time when that issue occurs?

tags: added: l3-dvr-backlog
Revision history for this message
hojin kim (khoj) wrote :
Download full text (19.1 KiB)

1. I dumped all VM and all host each other, and I found all icmp request and reply disappeared in host . request don't even reach the VM.

2. Only openvswitch_vswitchd helped to solve the problem. when we restarted neutron-openvswith_agent, but didn;t solve it.

3. I checked the openflow rules, but, I couldn't when issue happened. If it occur, I will check it. now there are no problem.

troubled server : kdash-portal01
floating IP : 172.29.75.11
internal IP : 20.21.21.7

I checked the normal status. If problem occured, we will gather some info again.

(virtenv) [root@2020c5lut005 ~]# openstack server show 1efccd39-68bd-4ec6-9f27-5a3604956cb8
+-------------------------------------+----------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | AUTO |
| OS-EXT-AZ:availability_zone | dash_zone |
| OS-EXT-SRV-ATTR:host | 2020c5lkt070 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | 2020c5lkt070 |
| OS-EXT-SRV-ATTR:instance_name | instance-0000047f |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2020-11-25T07:45:28.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | dash-network=20.21.21.75, 172.29.75.11 |
| config_drive | True |
| created | 2020-11-02T07:40:12Z |
| flavor | c04r16os50 (33bf3602-ae10-4d2b-aeff-ba0500fb0ec3) |
| hostId | 4aad83584416e1459b112b0ce665895c9eef3e8a541b2a244e924c60 |
| id | 1efccd39-68bd-4ec6-9f27-5a3604956cb8 |
| image | |
| key_name | None |
| name | kdash-portal01 |
| progress | 0 |
| project_id | e347a41cea154277867246e...

Revision history for this message
hojin kim (khoj) wrote :

we found there are some duplicated IP in L3 HA. and we downed the duplicated IP

(virtenv) [root@2020c5lut005 comadm]# openstack port list |grep 169.254.192.241
| 20e6b396-36c9-447a-8d6b-07d37e14eb1f | HA port tenant 39443b1d1a07487fbfcff2c950133640 | fa:16:3e:12:3f:bc | ip_address='169.254.192.241', subnet_id='8ece962c-f76d-4e50-880c-31b8d5753f8a' | DOWN |
| 4b529a1b-8113-40a7-9217-802a35c4b393 | HA port tenant ed8813a5a39845a687390d30e45087f9 | fa:16:3e:42:03:f0 | ip_address='169.254.192.241', subnet_id='4de327d6-7826-48b4-bfbf-30abd173a455' | ACTIVE |
(virtenv) [root@2020c5lut005 comadm]# openstack port list |grep 169.254.193.61
| 70379827-7662-4fe8-a65b-93bd4f2dce09 | HA port tenant ecde2fec3ff14564b5d2dc5e8cd182ea | fa:16:3e:ff:28:f9 | ip_address='169.254.193.61', subnet_id='fd248407-3e9d-49bf-99f6-fc754e2ef8fa' | DOWN |
| 73a5f186-ae78-4119-acb7-a641e4effa14 | HA port tenant f0eaa8a17b914b8b93826a3e24be81eb | fa:16:3e:44:49:c7 | ip_address='169.254.193.61', subnet_id='3687174b-96d5-4129-9026-07429bf38795' | ACTIVE |

I can;t understand, why allocated the duplicated IP.

and we will check again.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

Thx for info.

Regarding comment #2:

Ad. 1. I don't understand exactly - You said that "all icmp request and reply disappeared in host" - how there can be reply if request didn't get to the VM even?

Ad. 2. This leads me to think that this may be some openvswitch issue - if that would be issue with neutron and how neutron is configuring things on the host, restart of neutron agents should fixes the problem also,

Regarding comment #3:
That is normal - HA networks are created per tenant and those are isolated, tenant networks so You can have some IPs in different networks. That is not a problem at all.

Revision history for this message
hojin kim (khoj) wrote :

1. The VM requests the ping to another vm , and another vm reply. I found the reply arrived in host, but VM can't get

Even after solving the IP issue, the problem was not solved.

I read the article saying that using l3ha and dvr together would be a problem, and removed the no-ha option for l3ha.

We will keep looking and update the content.

Revision history for this message
hojin kim (khoj) wrote :

after removing the l3ha, the problem didn;t removed.
And after removing l3ha, the disorder suddenly increased, and after 3 minutes of service, it was cut off for about 5 minutes, and this condition continued.

Even if I restarted ovs-switchd, it died again after 1-2 seconds ping.
This problem occurred only in one of the four zones at present, and we are going to proceed in the direction of removing the current DVR.

Before removing DVR, we will test in network switch

This structure is L2 leaf-spine structure, 10 leaves, 2 spines, and all are dell switches.
It is composed of LACP and VLT, and L2 is all composed of trunk mode.

We are suspicious of the switch aging time, and we will change the setting and check the impact. and we will operate the .L3 switch as a single for a while

Revision history for this message
hojin kim (khoj) wrote :

we found H/W network had no problem.
we checked the bug "https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652"
we will adapt it(openvswitch 2.12.1)
There are no rpm version for centos 7 . will build the rpm. we will monitoring.Thanks.

Revision history for this message
hojin kim (khoj) wrote :

we adapted it, and the problem was solved . Thanks

LIU Yulong (dragon889)
Changed in neutron:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers