Floating IPs broken after kernel upgrade to Centos/RHEL 7.5 - DNAT not working

Bug #1776778 reported by Arjun Baindur
48
This bug affects 8 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

Since upgrading to Centos 7.5 (with kernel 3.10.0-862), floating IP functionality has been completely busted. Packets arrive inbound to qrouter from fip namespace via RFP, but are not DNAT'd or routed, as we see nothing going out qr- interface. For outbound packets leaving the VM, they are fine, but then all responses are again dropped inbound to qrouter after arriving on rfp. It appears the DNAT rules in the "-t nat" iptables within qrouter are not being hit (packet counters are zero).

SNAT functionality works when we remove floating IP from the VM (VM can then ping outbound). So problem seems isolated to DNAT / qrouter receiving packets from fip?

We are able to reproduce this 100% consistently, whenever we update our working centos 7.2 / centos 7.4 hosts to 7.5. Nothing changes except a "yum update". All routes, rules, iptables are identical on a working older host vs. broken centos 7.5 host.

I added some basic rules to log packets at top of PREROUTING chain in raw, mangle, and nat tables. Filtering either by my source IP, or all packets on -i rfp ingress interface. While packet counters increment for raw and mangle, they remain at 0 for nat, indicating the nat iptable is not invoked for PREROUTING.

Floating IP = 10.8.17.52, Fixed IP = 192.168.94.9.

[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 tcpdump -l -evvvnn -i rfp-f48d5536-e
tcpdump: listening on rfp-f48d5536-e, link-type EN10MB (Ethernet), capture size 262144 bytes
13:42:00.345440 7a:3b:f1:c7:5d:4e > aa:24:89:9e:c8:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 1832, offset 0, flags [DF], proto ICMP (1), length 84)
    10.4.165.22 > 10.8.17.52: ICMP echo request, id 5771, seq 1, length 64
13:42:01.344047 7a:3b:f1:c7:5d:4e > aa:24:89:9e:c8:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 1833, offset 0, flags [DF], proto ICMP (1), length 84)
    10.4.165.22 > 10.8.17.52: ICMP echo request, id 5771, seq 2, length 64
13:42:02.398300 7a:3b:f1:c7:5d:4e > aa:24:89:9e:c8:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 1834, offset 0, flags [DF], proto ICMP (1), length 84)
    10.4.165.22 > 10.8.17.52: ICMP echo request, id 5771, seq 3, length 64
13:42:03.344345 7a:3b:f1:c7:5d:4e > aa:24:89:9e:c8:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 1835, offset 0, flags [DF], proto ICMP (1), length 84)
    10.4.165.22 > 10.8.17.52: ICMP echo request, id 5771, seq 4, length 64
^C
4 packets captured
4 packets received by filter
0 packets dropped by kernel
[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 tcpdump -l -evvvnn -i qr-295f9857-21
tcpdump: listening on qr-295f9857-21, link-type EN10MB (Ethernet), capture size 262144 bytes

***CRICKETS***

[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: rfp-f48d5536-e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether aa:24:89:9e:c8:f0 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 169.254.106.114/31 scope global rfp-f48d5536-e
       valid_lft forever preferred_lft forever
    inet6 fe80::a824:89ff:fe9e:c8f0/64 scope link
       valid_lft forever preferred_lft forever
59: qr-295f9857-21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:3d:f1:12 brd ff:ff:ff:ff:ff:ff
    inet 192.168.94.1/24 brd 192.168.94.255 scope global qr-295f9857-21
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe3d:f112/64 scope link
       valid_lft forever preferred_lft forever

[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 ip route
169.254.106.114/31 dev rfp-f48d5536-e proto kernel scope link src 169.254.106.114
192.168.94.0/24 dev qr-295f9857-21 proto kernel scope link src 192.168.94.1
[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
57481: from 192.168.94.9 lookup 16
3232259585: from 192.168.94.1/24 lookup 3232259585
[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 ip route show table 16
default via 169.254.106.115 dev rfp-f48d5536-e
[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 ip neighbor
169.254.106.115 dev rfp-f48d5536-e lladdr 7a:3b:f1:c7:5d:4e STALE
192.168.94.4 dev qr-295f9857-21 lladdr fa:16:3e:cf:a1:08 PERMANENT
192.168.94.13 dev qr-295f9857-21 lladdr fa:16:3e:91:37:54 PERMANENT
192.168.94.2 dev qr-295f9857-21 lladdr fa:16:3e:b2:18:5e PERMANENT
192.168.94.9 dev qr-295f9857-21 lladdr fa:16:3e:6c:4a:3b PERMANENT

[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 iptables-save
# Generated by iptables-save v1.4.21 on Wed Jun 13 15:20:58 2018
*raw
:PREROUTING ACCEPT [5384:453413]
:OUTPUT ACCEPT [65:5637]
:neutron-l3d-OUTPUT - [0:0]
:neutron-l3d-PREROUTING - [0:0]
-A PREROUTING -j neutron-l3d-PREROUTING
-A OUTPUT -j neutron-l3d-OUTPUT
COMMIT
# Completed on Wed Jun 13 15:20:58 2018
# Generated by iptables-save v1.4.21 on Wed Jun 13 15:20:58 2018
*mangle
:PREROUTING ACCEPT [5281:443604]
:INPUT ACCEPT [4:336]
:FORWARD ACCEPT [20:1680]
:OUTPUT ACCEPT [4:336]
:POSTROUTING ACCEPT [24:2016]
:neutron-l3d-FORWARD - [0:0]
:neutron-l3d-INPUT - [0:0]
:neutron-l3d-OUTPUT - [0:0]
:neutron-l3d-POSTROUTING - [0:0]
:neutron-l3d-PREROUTING - [0:0]
:neutron-l3d-float-snat - [0:0]
:neutron-l3d-floatingip - [0:0]
:neutron-l3d-mark - [0:0]
:neutron-l3d-scope - [0:0]
-A PREROUTING -j neutron-l3d-PREROUTING
-A INPUT -j neutron-l3d-INPUT
-A FORWARD -j neutron-l3d-FORWARD
-A OUTPUT -j neutron-l3d-OUTPUT
-A POSTROUTING -j neutron-l3d-POSTROUTING
-A neutron-l3d-PREROUTING -j neutron-l3d-mark
-A neutron-l3d-PREROUTING -j neutron-l3d-scope
-A neutron-l3d-PREROUTING -m connmark ! --mark 0x0/0xffff0000 -j CONNMARK --restore-mark --nfmask 0xffff0000 --ctmask 0xffff0000
-A neutron-l3d-PREROUTING -j neutron-l3d-floatingip
-A neutron-l3d-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j MARK --set-xmark 0x1/0xffff
-A neutron-l3d-float-snat -m connmark --mark 0x0/0xffff0000 -j CONNMARK --save-mark --nfmask 0xffff0000 --ctmask 0xffff0000
-A neutron-l3d-scope -i qr-295f9857-21 -j MARK --set-xmark 0x4000000/0xffff0000
-A neutron-l3d-scope -i rfp-f48d5536-e -j MARK --set-xmark 0x4000000/0xffff0000
COMMIT
# Completed on Wed Jun 13 15:20:59 2018
# Generated by iptables-save v1.4.21 on Wed Jun 13 15:20:59 2018
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [1:84]
:POSTROUTING ACCEPT [3:252]
:neutron-l3d-OUTPUT - [0:0]
:neutron-l3d-POSTROUTING - [0:0]
:neutron-l3d-PREROUTING - [0:0]
:neutron-l3d-float-snat - [0:0]
:neutron-l3d-snat - [0:0]
:neutron-postrouting-bottom - [0:0]
-A PREROUTING -j neutron-l3d-PREROUTING
-A OUTPUT -j neutron-l3d-OUTPUT
-A POSTROUTING -j neutron-l3d-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3d-POSTROUTING ! -i rfp-f48d5536-e ! -o rfp-f48d5536-e -m conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3d-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3d-PREROUTING -d 10.8.17.52/32 -i rfp-f48d5536-e -j DNAT --to-destination 192.168.94.9
-A neutron-l3d-float-snat -s 192.168.94.9/32 -j SNAT --to-source 10.8.17.52
-A neutron-l3d-snat -j neutron-l3d-float-snat
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-l3d-snat
COMMIT
# Completed on Wed Jun 13 15:20:59 2018
# Generated by iptables-save v1.4.21 on Wed Jun 13 15:20:59 2018
*filter
:INPUT ACCEPT [4:336]
:FORWARD ACCEPT [20:1680]
:OUTPUT ACCEPT [4:336]
:neutron-filter-top - [0:0]
:neutron-l3d-FORWARD - [0:0]
:neutron-l3d-INPUT - [0:0]
:neutron-l3d-OUTPUT - [0:0]
:neutron-l3d-local - [0:0]
:neutron-l3d-scope - [0:0]
-A INPUT -j neutron-l3d-INPUT
-A FORWARD -j neutron-filter-top
-A FORWARD -j neutron-l3d-FORWARD
-A OUTPUT -j neutron-filter-top
-A OUTPUT -j neutron-l3d-OUTPUT
-A neutron-filter-top -j neutron-l3d-local
-A neutron-l3d-FORWARD -j neutron-l3d-scope
-A neutron-l3d-INPUT -m mark --mark 0x1/0xffff -j ACCEPT
-A neutron-l3d-INPUT -p tcp -m tcp --dport 9697 -j DROP
-A neutron-l3d-scope -o qr-295f9857-21 -m mark ! --mark 0x4000000/0xffff0000 -j DROP
-A neutron-l3d-scope -o rfp-f48d5536-e -m mark ! --mark 0x4000000/0xffff0000 -j DROP
COMMIT
# Completed on Wed Jun 13 15:20:59 2018

Also as you can see, the qrouter itself can ping the VM's fixed IP. It just does not DNAT/route packets arriving from the fip namespace:

[root@centos7-neutron-template ~]# ip netns exec qrouter-f48d5536-eefa-4410-b17b-1b3d14426323 ping 192.168.94.9
PING 192.168.94.9 (192.168.94.9) 56(84) bytes of data.
64 bytes from 192.168.94.9: icmp_seq=1 ttl=64 time=6.37 ms
64 bytes from 192.168.94.9: icmp_seq=2 ttl=64 time=1.02 ms
64 bytes from 192.168.94.9: icmp_seq=3 ttl=64 time=1.11 ms
64 bytes from 192.168.94.9: icmp_seq=4 ttl=64 time=0.599 ms

This is in Newton release BTW

Revision history for this message
Arjun Baindur (abaindur) wrote :

What is also observed is that the fip namespace can ping the local floating IP. We see the qrouter NAT'ing, then routing packet to VM. But this does not occur when we ping from outside the host:

[root@centos7-neutron-template ~]# ip netns exec fip-81cc5d09-5ce0-4048-941c-fc9b9bf64139 ping 10.8.17.52 -I 10.8.17.51
PING 10.8.17.52 (10.8.17.52) from 10.8.17.51 : 56(84) bytes of data.
64 bytes from 10.8.17.52: icmp_seq=1 ttl=63 time=4.55 ms
64 bytes from 10.8.17.52: icmp_seq=2 ttl=63 time=1.62 ms

Revision history for this message
Arjun Baindur (abaindur) wrote :

Here are contents of sysctl.conf:

[root@centos7-neutron-template ~]# cat /etc/sysctl.conf
# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.ens32.rp_filter = 0
#net.ipv4.conf.ens34.rp_filter = 0
#net.ipv4.conf.ens35.rp_filter = 0
#net.bridge.bridge-nf-call-iptables = 1
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv4.conf.all.rp_filter=0
net.ipv4.conf.default.rp_filter=0
net.bridge.bridge-nf-call-iptables=1
net.ipv4.ip_forward=1
net.ipv4.tcp_mtu_probing=1

iptables and conntrack versions are reporting to be the same as working centos 7.4 host

description: updated
description: updated
Revision history for this message
Arjun Baindur (abaindur) wrote :

Issue specifically seems to be with the newer kernel that came with update to Centos 7.5. We also hit issue on a Centos 7.4 image, but which has the same newer kernel:

3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Brian Haley (brian-haley) wrote :

This is a known issue with that particular kernel, there is a problem with the "notrack" netfilter code. I would recommend reverting to the older one or waiting for an update to be published.

Arjun Baindur (abaindur)
summary: - Floating IPs broken after upgrade to Centos 7.5 - DNAT not working
+ Floating IPs broken after kernel upgrade to Centos/RHEL 7.5 - DNAT not
+ working
Revision history for this message
Arjun Baindur (abaindur) wrote :

Thanks, please let me know if you have any more info on this issue - such as an upstream RHEL bug, mailing list discussion, IRC chat notes, technical details of exactly what is going wrong, etc...

It would be helpful to know what to keep an eye out for and when this gets fixed, or what info to pass on to customers

Since this is netfilter related, does it also affect security groups, or SNAT functionality in any way?

description: updated
Revision history for this message
Arjun Baindur (abaindur) wrote :

Does this affect security groups?

Is this problem specific to the 3.10.0-862 kernel introduced in RHEL 7.5 / Centos 7.5, or does it also affect the 4.X kernels available in the latest Ubuntu releases?

Revision history for this message
Arjun Baindur (abaindur) wrote :

FYI this seems to be a pretty serious bug. It kills a core neutron functionality and leaves VMs inaccessible. Is there no advisory/kernel bug/other Openstack bug/release notes about this? I could not find anything. How are people to know when it is safe to do a simple yum update?

Revision history for this message
Blake Covarrubias (blakegc) wrote :

Here's the bug report which details the exact issue: https://bugzilla.redhat.com/show_bug.cgi?id=1572983.

Revision history for this message
Brian Haley (brian-haley) wrote :

I hadn't linked the BZ since I thought it was private, but guess that isn't the case any more. Assuming people have access of course...

Revision history for this message
ByungYeol Woo (wby1089) wrote :

I found same problem in 4.15.0-23-generic, Ubuntu 18.04.

Revision history for this message
LIU Yulong (dragon889) wrote :

FYI:
This kernal bug can also break the L3 routers:
https://bugs.centos.org/view.php?id=11238

Revision history for this message
men (keyi) wrote :

Is there a message to fix this bug now?

Revision history for this message
Brian Haley (brian-haley) wrote :

I'm not sure of the last question, but I do not see a kernel >862 in the centos repo yet. Hopefully that gets published soon since there is no workaround in neutron for this.

Revision history for this message
Dongcan Ye (hellochosen) wrote :

I met same problem in CentOS7.5 with kernel 3.10.0-862.el7.x86_64.

Revision history for this message
Dongcan Ye (hellochosen) wrote :

It seems that bug fixed in kernel-3.10.0-898.el7.

Changed in neutron:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.