Broken distributed virtual router w/ lbaas v1

Bug #1629539 reported by Turbo Fredriksson on 2016-10-01
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Undecided
Unassigned

Bug Description

I wish I could come up with a smarter, more descriptive title for this, but if someone can after reading my report, feel free to update it.

I installed my second controller the other day (because of resource constraints, I run ALL my Openstack control services - APIs, Engines, Servers etc, etc - _everything_ but 'nova-compute' and 'nova-console' - on one physical host) and then one of my LBaaSv1 (haven't gotten around to try enabling v2 again, last time I got some issues which was reported elsewhere in the tracker) stopped working.

After almost a day trying to figure out why only one and how to fix it, I realized it must be the _router_ not the load balancer that's at fault (see below).

Broken LBaaSv1 VIP: 10.100.0.16/24
Broken LBaaSv1 Floating IP: 10.0.5.90/24
Working LBaaSv1 Floating IP: 10.0.4.190/24
Router VIF namespace: 10.0.5.100 (not sure exactly what this is, but for some reason it have 'stolen' the "GW functionality" (incoming) on the router from the .253 interfaces)
Router qrouter namespace: 10.0.4.253 + 10.0.5.253 (these are on the 'External Gateway' on the router and is supposed to be the routers GW)
Primary GW/FW/NAT: eth1:192.168.69.1/24, eth2:10.0.4.254/24, eth2:10.0.5.254/24

=> ==========================================
=> From a physical host outside the OS network(s) (i.e. from the 192.168.69.0/24 network):

traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets <= CORRECT
 1 192.168.69.1 0.088 ms 0.077 ms 0.064 ms
 2 10.0.4.253 0.262 ms 0.246 ms 0.258 ms
 3 10.100.0.16 2.365 ms 2.348 ms 2.310 ms

traceroute to 10.0.5.90 (10.0.5.90), 30 hops max, 60 byte packets <= WRONG, LBaaSv1 don't work
 1 192.168.69.1 0.156 ms 0.138 ms 0.123 ms
 2 10.0.5.100 0.834 ms 0.863 ms 0.851 ms
 3 * * *
 4 10.0.5.90 1.487 ms 1.564 ms 1.561 ms

traceroute to 10.0.4.190 (10.0.4.190), 30 hops max, 60 byte packets <= WRONG, but LBaaSv1 work
 1 192.168.69.1 0.130 ms 0.112 ms 0.097 ms
 2 10.0.5.100 1.595 ms 1.581 ms 1.568 ms
 3 * * *
 4 10.0.4.190 2.265 ms 2.262 ms 2.251 ms

=> ==========================================
=> From an instance (inside the 10.100.0.0/24 subnet - all ICMP open)

traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets
 1 * * *
 2 * * *
 3 *^C

PING 10.100.0.16 (10.100.0.16) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.32 ms
64 bytes from 10.100.0.16: icmp_seq=2 ttl=64 time=0.548 ms
64 bytes from 10.100.0.16: icmp_seq=3 ttl=64 time=0.589 ms
^C

PING 10.0.5.90 (10.0.5.90) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.02 ms
64 bytes from 10.0.5.90: icmp_seq=1 ttl=60 time=1.68 ms (DUP!)
^C

PING 10.0.4.190 (10.0.4.190) 56(84) bytes of data.
64 bytes from 10.100.0.4: icmp_seq=1 ttl=64 time=0.925 ms
64 bytes from 10.0.4.190: icmp_seq=1 ttl=60 time=467 ms (DUP!)
^C

=> ==========================================
=> The 'actual' problem

=> From a host on the 192.168.69.0/24 network
$ curl --insecure https://10.100.0.16:8140/
curl: (35) Unknown SSL protocol error in connection to 10.100.0.16:8140 <= FAIL, never reaches backend server
$ curl --insecure https://10.0.5.90:8140/
The environment must be purely alphanumeric, not '' <= Actually working

=> From an instance
$ curl --insecure https://10.100.0.16:8140/
The environment must be purely alphanumeric, not '' <= Actually working
$ curl --insecure https://10.0.5.90:8140/
curl: (35) Unknown SSL protocol error in connection to 10.0.5.90:8140 <= FAIL, never reaches backend server

Testing a connection to 10.0.4.190 with curl won't work - it's "ldaps" on port 636. But doing a ldapsearch from 192.168.69.0/24 to that works, but not from an instance. So that is broken as well, even though I labeled it 'working' above :(. Just "broken" in a different way..

=> ==========================================
=> Relevant name spaces on the controllers:

=>
=> Primary Controller
=>

=> ip netns | sort
fip-cd30c1bb-3db6-488c-b448-6cb4454783be
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7

=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
66: fg-38e452be-d4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    inet 10.0.5.100/24 brd 10.0.5.255 scope global fg-38e452be-d4

Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.5.254 0.0.0.0 UG 0 0 0 fg-38e452be-d4
10.0.4.189 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.190 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.195 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 fg-38e452be-d4
10.0.5.90 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.92 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.99 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 fpr-4b3639a1-8

=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
2: rfp-4b3639a1-8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    inet 10.0.5.90/32 brd 10.0.5.90 scope global rfp-4b3639a1-8
    inet 10.0.4.190/32 brd 10.0.4.190 scope global rfp-4b3639a1-8
71: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default
    inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51

Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 rfp-4b3639a1-8

=>
=> Secondary Controller
=>

=> ip netns
snat-4b3639a1-880f-4b55-989f-c6f654e562a7
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7

=> snat-4b3639a1-880f-4b55-989f-c6f654e562a7
62: qg-1d52c5b9-4b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 10.0.4.253/24 brd 10.0.4.255 scope global qg-1d52c5b9-4b
    inet 10.0.5.253/24 brd 10.0.5.255 scope global qg-1d52c5b9-4b

Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.4.254 0.0.0.0 UG 0 0 0 qg-1d52c5b9-4b
10.0.4.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 sg-ed603ce2-fe

=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
51: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51

Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51

=> ==========================================
=> The iptables rules in the name spaces

=>
=> Primary Controller
=>

=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0

neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */

=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 mark match 0x1/0xffff
DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:9697
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000

=>
=> Secondary Controller
=>

neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
REDIRECT tcp -- 0.0.0.0/0 169.254.169.254 tcp dpt:80 redir ports 9697
SNAT all -- 10.100.0.25 0.0.0.0/0 to:10.0.5.92
SNAT all -- 10.100.0.16 0.0.0.0/0 to:10.0.5.90
SNAT all -- 10.104.0.44 0.0.0.0/0 to:10.0.5.99
SNAT all -- 10.100.0.3 0.0.0.0/0 to:10.0.4.189
SNAT all -- 10.104.0.27 0.0.0.0/0 to:10.0.4.195
SNAT all -- 10.100.0.4 0.0.0.0/0 to:10.0.4.190
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */

Because the LBaaSv1 worked just fine before I distributed the router (and the vif and snat name spaces where created) and from what I can see, all interfaces, routes and iptables rules seems just fine, I can only deduce that there's something wrong with some of this and I'm guessing it's with the iptables rules somehow.

But because I don't know how they're (the vif and snat name spaces are supposed to work, I'm unsure on how to proceed from here.

description: updated
tags: added: l3-dvr-backlog lbaas
removed: distributed name router snat spaces vif

Can you provide more info on your release? Bear in mind that LBaas v1 has been deprecated for a few cycles now and removed from Newton.

Changed in neutron:
status: New → Incomplete
summary: - Broken distributed virtual router
+ Broken distributed virtual router w/ lbaas v1

Mitaka on Debian GNU/Linux Sid. I've tried to get LBaaSv2 working, but after a week of attempts, I had to give up for now. I reported issue(s) here several weeks ago about this, but apparently "I'm seeing things" so I'll try again at a later date.

But as I've said in the initial post, I don't really think it's to do with the load balancer. All indications show that it's with the router. The _distributed_ router..

The LBs worked just fine before I distributed the router, which I think is another indication that it's with the router..

Hi Turbo Fredriksson you mentioned that it was working with the regular router and you moved on to the DVR. Did you migrate the legacy router to DVR or is it a new router that you created for DVR.

We have not seen any issues with respect to the LBaaS v1. But there are known issues with respect to the DVR and LBaaS ( Octavia ).

But let me test it and let you know.

I migrated my legacy one to DVR.

Did you remove the service before you migrated or the lbaas service was still associated with the router while the migration occured.

I just updated the configuration, restarted all the services, took down the router, distributed it and then took it up again.

    openstack-configure set /etc/neutron/neutron.conf DEFAULT router_distributed true
    openstack-configure set /etc/neutron/neutron.conf DEFAULT router_scheduler_driver \
        neutron.scheduler.l3_agent_scheduler.LeastRoutersScheduler
    openstack-configure set /etc/neutron/neutron.conf DEFAULT router_auto_schedule true
    openstack-configure set /etc/neutron/neutron.conf DEFAULT \
        allow_automatic_l3agent_failover true
    openstack-configure set /etc/neutron/neutron.conf DEFAULT l3_ha false
    openstack-configure set /etc/neutron/neutron.conf DEFAULT max_l3_agents_per_router 0
    openstack-configure set /etc/neutron/neutron.conf DEFAULT min_l3_agents_per_router 2

    openstack-configure set /etc/neutron/l3_agent.ini DEFAULT agent_mode dvr

    openstack-configure set /etc/neutron/plugins/ml2/openvswitch_agent.ini agent \
        enable_distributed_routing true

    for init in /etc/init.d/neutron-*; do $init restart; done
    neutron router-update --admin-state-up False provider-tenants
    neutron router-update --distributed True provider-tenants
    neutron router-update --admin-state-up True provider-tenants

As far as I remember, I did nothing else than this. I didn't remove anything (because that would have required me to destroy EVERYTHING, because everything "hangs" on of the router).

tags: removed: lbaas
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers