loosing connectivity to instance with FloatingIP randomly

Bug #1864963 reported by Sergey Yezhkov
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

I have problem with randomly loosing connectivity to instances by FloatingIP.
Instances fully functional and can ping each other by private IP.
This problem state of network appear and disappear randomly but often state changed after create/delete instance in this private network or reboot neutron-openvswitch-agent on the host where instances "live".
Usually, I have loosing connectivity to all instances in same private network and placed on same host. Instances on same host but in different network or in same network but on other host works properly.
Live migration of instance from one host to another usually restore connection to instance.

No errors in neutron/nova/ovs logs.

In problem state of network
tcpdump icmp packets on instance/tap port shows that requests reach instance and instance answer on it.
tcpdump on router/qr port shows only requests but no answers.

In working state of network I see request and answers on router/qr port.

I try to dump OVS flows in problem and working network state, but i did not found any differences in flows.

ovs-appctl ofproto/trace for icmp answers looks the same for working/problem state and shows correct output router port

Environment:
I use HA, DVR routers with OVS
OpenStack Train Release installed by kolla
Ubintu Xenial
Neutron 15.0.2 (but problem start appear after upgrade form Rocky to Train release to 15.0.0 neutron version, then to 15.0.1 and now 15.0.2)
OVS version 2.12.0
Changing firewall driver from OVS Hybrid to OVS Native do not help

Please help me localize and troubleshoot this bug!

Sergey Yezhkov (yezhkov)
description: updated
tags: added: l3-dvr-backlog
Revision history for this message
LIU Yulong (dragon889) wrote :

I have some questions about the issue:
1. Does your deployment have centralized floating IPs? Your L3 agents in compute nodes are all in "dvr" mode?
2. What's your tenant network type? vlan? vxlan? or?
3. What's your external network type?
4. What's the bond mode of physical NICs for your floating IP traffic running on compute hosts? mode 6?

And if the config could be pasted here, that will help a lot for the team to find out the problem.

Revision history for this message
Sergey Yezhkov (yezhkov) wrote :
Download full text (3.8 KiB)

Thanks for interest!

1.
l3 agent mode on controllers
agent_mode = dvr_snat
on compute hosts
agent_mode = dvr

2. tenant network type = vxlan

3. external network type = vlan

4. not sure that i understand question
but i have physical NIC 'itrunk' which connected to br-ex in OVS
this NIC configured to pass any VLAN traffic like trunk port
--
    Bridge br-ex
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        datapath_type: system
        Port phy-br-ex
            Interface phy-br-ex
                type: patch
                options: {peer=int-br-ex}
        Port itrunk
            Interface itrunk
--

My configs:

l3_agent.ini (for comp node, on controller only diff agent_mode is dvr_snat) --
[DEFAULT]
agent_mode = dvr
ha_vrrp_health_check_interval = 5

[agent]

[ovs]
ovsdb_connection = tcp:127.0.0.1:6640
--

ml2_conf.ini (same on comp an control nodes) --
[ml2]
type_drivers = vxlan,vlan,flat
tenant_network_types = vxlan,vlan,flat
mechanism_drivers = openvswitch,l2population
extension_drivers = qos,port_security,dns

[ml2_type_vlan]
network_vlan_ranges = vlans1:156:158,vlans1:163:165

[ml2_type_flat]
flat_networks = public1

[ml2_type_vxlan]
vni_ranges = 1:1000

[securitygroup]
firewall_driver = openvswitch

[agent]
tunnel_types = vxlan
l2_population = true
arp_responder = true
enable_distributed_routing = True
extensions = qos

[ovs]
bridge_mappings = public1:br-pub,vlans1:br-ex
datapath_type = system
ovsdb_connection = tcp:127.0.0.1:6640
local_ip = [...]
of_connect_timeout = 300
of_request_timeout = 300
of_inactivity_probe = 60
--

neutron.conf --
[DEFAULT]
debug = False
log_dir = /var/log/kolla/neutron
use_stderr = False
bind_host = [...]
bind_port = 9696
api_paste_config = /usr/share/neutron/api-paste.ini
endpoint_type = internalURL
api_workers = 5
metadata_workers = 5
rpc_workers = 3
rpc_state_report_workers = 3
metadata_proxy_socket = /var/lib/neutron/kolla/metadata_proxy
interface_driver = openvswitch
allow_overlapping_ips = true
core_plugin = ml2
service_plugins = qos,router
dhcp_agents_per_network = 2
l3_ha = true
max_l3_agents_per_router = 3
transport_url = rabbit://[...]
router_distributed = True
dns_domain = os.loc.
external_dns_driver = designate
ipam_driver = internal
rpc_response_timeout = 180

[nova]
auth_url = http://[...]:35357
auth_type = password
project_domain_id = default
user_domain_id = default
region_name = RegionOne
project_name = service
username = nova
password = [...]
endpoint_type = internal

[oslo_middleware]
enable_proxy_headers_parsing = True

[oslo_concurrency]
lock_path = /var/lib/neutron/tmp

[agent]
root_helper = sudo neutron-rootwrap /etc/neutron/rootwrap.conf

[database]
connection = mysql+pymysql://[...]/neutron
max_retries = -1

[keystone_authtoken]
www_authenticate_uri = http://[...]:5000
auth_url = http://[...]:35357
auth_type = password
project_domain_id = default
user_domain_id = default
project_name = service
username = neutron
password = [...]
memcache_security_strategy = ENCRYPT
memcache_secret_key = [...]
memcached_servers = [...]

[oslo_messaging_notifications]
transport_url = rabbit://[...]
driver = messagingv2
topics = notifications,...

Read more...

Revision history for this message
Sergey Yezhkov (yezhkov) wrote :

And I additional information about physical NICs bonding, used mode-1 (active-backup)

--
auto ens1f1
iface ens1f1 inet manual
    bond-master itrunk
    bond-primary ens1f1
auto ens0f1
iface ens0f1 inet manual
    bond-master itrunk
    bond-primary ens1f1
auto itrunk
iface itrunk inet manual
    bond-mode active-backup
    bond-slaves none
    bond-primary ens1f1
    bond-primary-reselect always
    bond-downdelay 200
    bond-miimon 100
    bond-updelay 200
    hwaddress [...]
--

Revision history for this message
Pavel Szalbot (pavel-szalbot) wrote :

I have the same problem. My physical interface is LACP bond, openstack is deployed manually, compute node is running on CentOS 8.

I found out that VM responds to ping with size 5000 (probably due to multiple packets sent) even if it does not respond to standard 56 bytes payload. However this is the only traffic from external network VM responds to. Private networking is fine.

Traffic seems to be dropped in openvswitch. ICMP reply is visible on tap interface, but does not show up on qr. conntrack does not show the connection unless Larger payload is used.

Seems pretty random. I have spent many hours with it, have other compute node that works with the same setup just with slightly older kernel, but both stock 4.18.

Revision history for this message
Pavel Szalbot (pavel-szalbot) wrote :

More observations: I have other self-service network from different tenant running on the same compute node without problems. Instances from problematic network run on different nodes without issues although these are using iptables_hybrid except the one node I mentioned which however works with openvswitch firewall driver.

Kernel 4.18.0-193.28.1.el8_2.x86_64, neutron packages are 15.3.0-1.el8, openvswitch 2.12.0.

Did restart the whole compute node, did not help.

Revision history for this message
Pavel Szalbot (pavel-szalbot) wrote :

Issue persists after upgrade to kernel 5.9.6

Running only one instance on the compute node allowed to identify following:

$ ovs-ofctl -O OpenFlow14 dump-flows br-int --color --names --rsort=priority table=72

This line is the only one incrementing (watch -n1 -d):
cookie=0x46c2e1bc88f3da56, duration=273.104s, table=72, n_packets=3723, n_bytes=359585, priority=50,ct_state=+inv+trk actions=resubmit(,93)

Seems like conntrack marks reply as invalid. This can be checked:

$ ip netns exec qrouter-9984c9f6-cf31-4fb4-9463-82db3c51f0ae conntrack -E

[NEW] tcp 6 120 SYN_SENT src=10.153.0.245 dst=X.X.X.X sport=51266 dport=22 [UNREPLIED] src=192.168.9.78 dst=10.153.0.245 sport=22 dport=51266

Same problem with icmp. Security group allows IPv4 from and to anywhere. Conntrack is by far not full. iptables_hybrid works...

Revision history for this message
hojin kim (khoj) wrote :
Download full text (4.6 KiB)

I have the same problem. My physical interface is LACP bond like upper case.

Problem: intermittently VM's floating IP connection is disconnected, and can be reconnected after 5-6 minutes

 Current configuration: 49node centos 7.8 kolla-ansible 9.2.1 (openvswitch - 2.12.0)

# docker exec -it openvswitch_vswitchd /bin/bash
(openvswitch-vswitchd)]# ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 2.12.0
(neutron-server)$ neutron-server --version
neutron-server 15.1.0

Phenomenon: The floating IP connection is disconnected, and the connection becomes possible again after 5-6 minutes. Occurs by rotating on multiple nodes.
          The internal ip connection is not disconnected, and if openvswitch_vswitchd is restarted in case of failure, the problem is solved.
          The public network, physnet1 (172.29.75.0~172.29.84.0), is tied in LACP mode by VLAN, and the TENANT NETWORK is composed of vxlan. (Use DVR)
As a result of the ping tcpdump test, the network sends a ping to the node with the vm, but the vm does not respond.

Bond_mode =4

Kernel version = 3.10.0-1127.el7.x86_64

=== ml2_conf.ini ============================================================
[root@2020c5lut006 neutron-server]# cat ml2_conf.ini
[ml2]
type_drivers = flat,vlan,vxlan
tenant_network_types = vxlan
mechanism_drivers = openvswitch,baremetal,l2population
extension_drivers = qos,port_security
path_mtu = 9000

[ml2_type_vlan]
network_vlan_ranges = physnet1

[ml2_type_flat]
flat_networks = *

[ml2_type_vxlan]
vni_ranges = 1:1000

[securitygroup]
firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver

[agent]
tunnel_types = vxlan
l2_population = true
arp_responder = true
enable_distributed_routing = True
extensions = qos

[ovs]
bridge_mappings = physnet1:br-ex,physnet2:br-cephfs,physnet3:br-api
datapath_type = system
ovsdb_connection = tcp:127.0.0.1:6640
local_ip = 20.21.2.101

==neutron.conf ====================================================================
[root@2020c5lut006 neutron-server]# cat neutron.conf
[DEFAULT]
debug = False
log_dir = /var/log/kolla/neutron
use_stderr = False
bind_host = 20.21.1.101
bind_port = 9696
api_paste_config = /usr/share/neutron/api-paste.ini
endpoint_type = internalURL
api_workers = 5
metadata_workers = 5
rpc_workers = 3
rpc_state_report_workers = 3
metadata_proxy_socket = /var/lib/neutron/kolla/metadata_proxy
interface_driver = openvswitch
allow_overlapping_ips = true
core_plugin = ml2
service_plugins = firewall_v2,qos,router
dhcp_agents_per_network = 2
l3_ha = true
max_l3_agents_per_router = 3
transport_url = rabbit://openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.101:5672,openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.102:5672,openstack:mMZl0hvZ5KSGQgfqtAbbBRkpMfEbzIKjDUHu8NSd@20.21.1.103:5672//
router_distributed = True
ipam_driver = internal
global_physnet_mtu = 9000

[nova]
auth_url = http://20.21.1.100:35357
auth_type = password
project_domain_id = default
user_domain_id = default
region_name = RegionOne
project_name = service
username = nova
password = rChwHtVHMqLK3AHRkKfZ7rxiQ74Am8EJHWvbEyQt
endpoint_type = internal

[oslo_middleware]
enable_proxy_headers_parsing = True

...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.