floating ip scheduled to wrong router

Bug #1422476 reported by Kevin Fox
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Security Advisory
Won't Fix
Undecided
Unassigned
neutron
Fix Released
Medium
Kevin Fox
Juno
Fix Released
Medium
Akihiro Motoki

Bug Description

I have a tenant network, two external networks, two routers (each one has gateway set to one of the external networks, and one port on the tenant network) and floating ip's on each external network.

In icehouse, this worked fine. the floating ip for each network was attached to the correct router. After upgrading to RDO Juno, I'm seeing both sets of floating ip's getting assigned to the same router:

[root@cloud ~]# ip netns exec qrouter-209158a6-ee00-405f-b929-7cb386460d94 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
44: qr-ffaaacc1-06: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:5c:e1:58 brd ff:ff:ff:ff:ff:ff
    inet 192.168.127.1/24 brd 192.168.127.255 scope global qr-ffaaacc1-06
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe5c:e158/64 scope link
       valid_lft forever preferred_lft forever
53: qg-1a260edc-41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:81:dc:f7 brd ff:ff:ff:ff:ff:ff
    inet 192.101.107.185/25 brd 192.101.107.255 scope global qg-1a260edc-41
       valid_lft forever preferred_lft forever
    inet 192.168.122.179/32 brd 192.168.122.179 scope global qg-1a260edc-41
       valid_lft forever preferred_lft forever
    inet 192.168.122.128/32 brd 192.168.122.128 scope global qg-1a260edc-41
       valid_lft forever preferred_lft forever
    inet 192.101.107.171/32 brd 192.101.107.171 scope global qg-1a260edc-41
       valid_lft forever preferred_lft forever
    inet 192.101.107.181/32 brd 192.101.107.181 scope global qg-1a260edc-41
       valid_lft forever preferred_lft forever
    inet 192.101.107.180/32 brd 192.101.107.180 scope global qg-1a260edc-41
       valid_lft forever preferred_lft forever
    inet 192.101.107.179/32 brd 192.101.107.179 scope global qg-1a260edc-41
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe81:dcf7/64 scope link
       valid_lft forever preferred_lft forever

Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

However this seems related to the RDO upgrade process (which is not covered by the OpenStack Security Advisory project). If it's the case, feel free to report it to the correct bugtracker there: https://bugzilla.redhat.com/enter_bug.cgi?product=RDO

Changed in ossa:
status: New → Incomplete
Revision history for this message
Kevin Fox (kevpn) wrote :

No, RDO does not use any special upgrade process not already part of neutron. I performed the upgrade as per the instructions in the official openstack neutron upgrade notes. So if it is upgrade related, then its neutron's bug still.

I'm not sure if it is even upgrade related though. It is having the behavior. I pointed out I had upgraded this cloud for completeness. I haven't had a non-upgraded juno to test with.

Another interesting data point.

All the vm's that had floating ips before the upgrade, have their floating ip addresses bound to the proper routers.

All new vm's that get launched today, go to the wrong routers. So it looks like a juno router/floating-ip scheduling issue.

My production cloud is broken now for new vm's. I have great incentive to track down this issue myself if I can't get any help. At very least, if someone knowledgeable with where floating ip's get scheduled to routers in the code could tell me where it happens, I can have a much better chance of finding the bug.

Thanks,
Kevin

Revision history for this message
Jeremy Stanley (fungi) wrote :

If there's no good reason to believe that an attacker could actually _cause_ this to happen in your environment, then it's not something for which we're going to issue a security advisory and we shouldn't leave it embargoed. While it continues to be marked private very few developers will be likely to see this and help you work out how it happened or how to fix it.

Revision history for this message
Kevin Fox (kevpn) wrote :

More info... tried neutron client with --debug and got:
DEBUG: keystoneclient.session REQ: curl -i -X PUT http://192.168.122.36:9696/v2.0/floatingips/6333cab1-b088-4f64-9861-9dcacd865887.json -H "User-Agent: python-neutronclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: TOKEN_REDACTED" -d '{"floatingip": {"port_id": "df6dac26-43ac-4bea-a907-0d51161575fe"}}'
DEBUG: keystoneclient.session RESP: [200] {'date': 'Wed, 18 Feb 2015 17:32:52 GMT', 'content-length': '376', 'content-type': 'application/json; charset=UTF-8', 'x-openstack-request-id': 'req-bf74e70d-bca1-4496-8e30-9b2d4f0b5cd9'}
RESP BODY: {"floatingip": {"floating_network_id": "547b4b20-a2c7-47fa-9628-167a38755e6a", "router_id": "209158a6-ee00-405f-b929-7cb386460d94", "fixed_ip_address": "192.168.127.154", "floating_ip_address": "192.168.122.68", "tenant_id": "536d674127fe44cda5699c9879979f2e", "status": "DOWN", "port_id": "df6dac26-43ac-4bea-a907-0d51161575fe", "id": "6333cab1-b088-4f64-9861-9dcacd865887"}}

The router id is the wrong router in the response. No where in the request did it ever state what router to assign it to. So it is definitely a scheduling issue somewhere in the server.

Revision history for this message
Kevin Fox (kevpn) wrote :

Ok. Then lets open it up then.

Thanks,
Kevin

Revision history for this message
Kevin Fox (kevpn) wrote :

Ok. I found a file that may be related...

/usr/lib/python2.7/site-packages/neutron/db/l3_db.py
in this function _get_router_for_floatingip

The diff between icehouse where things worked, and juno where they don't is:

[kfox@mantis bug]$ diff -u before.txt after.txt
--- before.txt 2015-02-18 10:21:39.195920132 -0800
+++ after.txt 2015-02-18 10:21:48.662695681 -0800
@@ -8,22 +8,19 @@
                      'which has no gateway_ip') % internal_subnet_id)
             raise n_exc.BadRequest(resource='floatingip', msg=msg)

- # find router interface ports on this network
- router_intf_qry = context.session.query(models_v2.Port)
- router_intf_ports = router_intf_qry.filter_by(
- network_id=internal_port['network_id'],
- device_owner=DEVICE_OWNER_ROUTER_INTF)
+ router_intf_ports = self._get_interface_ports_for_network(
+ context, internal_port['network_id'])

- for intf_p in router_intf_ports:
- if intf_p['fixed_ips'][0]['subnet_id'] == internal_subnet_id:
- router_id = intf_p['device_id']
- router_gw_qry = context.session.query(models_v2.Port)
- has_gw_port = router_gw_qry.filter_by(
- network_id=external_network_id,
- device_id=router_id,
- device_owner=DEVICE_OWNER_ROUTER_GW).count()
- if has_gw_port:
- return router_id
+ # This joins on port_id so is not a cross-join
+ routerport_qry = router_intf_ports.join(models_v2.IPAllocation)
+ routerport_qry = routerport_qry.filter(
+ models_v2.IPAllocation.subnet_id == internal_subnet_id
+ )
+
+ router_port = routerport_qry.first()
+
+ if router_port and router_port.router.gw_port:
+ return router_port.router.id

         raise l3.ExternalGatewayForFloatingIPNotFound(
             subnet_id=internal_subnet_id,

-----------------------------------------

I don't quite understand the difference yet, but at first glance, it looks like it may be the culprit.

information type: Private Security → Public
Revision history for this message
Jeremy Stanley (fungi) wrote :

I've marked the security advisory task for this bug as "won't fix" since it doesn't sound like something we'd issue an advisory over (class D in our vulnerability taxonomy https://wiki.openstack.org/wiki/Vulnerability_Management#Incident_report_taxonomy ). If the context for this bug changes, we can revisit that decision.

Changed in ossa:
status: Incomplete → Won't Fix
Revision history for this message
Kevin Fox (kevpn) wrote :

This code was changed in:
commit 93012915a3445a8ac8a0b30b702df30febbbb728
for bug:
https://bugs.launchpad.net/neutron/+bug/1378866

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/157167

Changed in neutron:
assignee: nobody → Kevin Fox (kevpn)
status: New → In Progress
Changed in neutron:
milestone: none → kilo-rc1
importance: Undecided → Medium
Changed in neutron:
assignee: Kevin Fox (kevpn) → Kevin Benton (kevinbenton)
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → nobody
Changed in neutron:
assignee: nobody → Kevin Fox (kevpn)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/157167
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=84bbcb6f2a5400112751517e41bb50b5056220e0
Submitter: Jenkins
Branch: master

commit 84bbcb6f2a5400112751517e41bb50b5056220e0
Author: Kevin Fox <email address hidden>
Date: Wed Feb 18 14:01:49 2015 -0800

    Fixes floating IP regression with multiple routers

    During the refactor here:
    Change-Id: I09e8a694cdff7f64a642a39b45cbd12422132806
    Too much code was removed and caused floating ips to get miss assigned when
    multiple routers with external networks in the same tenant are present. The
    first router in the tenant was always being chosen. This patch adds back
    some of the original code as well as a unit test.

    Change-Id: I6f663cb1ce3e4a1340c415d13787a9855c4dcac2
    Closes-Bug: 1422476

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-rc1 → 2015.1.0
Akihiro Motoki (amotoki)
tags: added: juno-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/180934

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/180934
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0d0675b0845c1b2ec740a45c29c5c0184b1d4e2a
Submitter: Jenkins
Branch: stable/juno

commit 0d0675b0845c1b2ec740a45c29c5c0184b1d4e2a
Author: Kevin Fox <email address hidden>
Date: Wed Feb 18 14:01:49 2015 -0800

    Fixes floating IP regression with multiple routers

    During the refactor here:
    Change-Id: I09e8a694cdff7f64a642a39b45cbd12422132806
    Too much code was removed and caused floating ips to get miss assigned when
    multiple routers with external networks in the same tenant are present. The
    first router in the tenant was always being chosen. This patch adds back
    some of the original code as well as a unit test.

    Change-Id: I6f663cb1ce3e4a1340c415d13787a9855c4dcac2
    Closes-Bug: 1422476
    (cherry picked from commit 84bbcb6f2a5400112751517e41bb50b5056220e0)

tags: added: in-stable-juno
Alan Pevec (apevec)
tags: removed: in-stable-juno juno-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.