cross tenant network polution post upgrade to Havana RC2

Bug #1240066 reported by James Page
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Undecided
Unassigned

Bug Description

We've been running Havana RC1 on our internal OpenStack deployment
that we use for QA'ing OpenStack on Ubuntu fine since last week - it
was running b3 prior to that; I bumped all of the packages to RC2 as
available this morning (including neutron and nova) and promptly saw a
whole raft of tenant network access issues which I think might be
related to the same underlying cause.

We run with Neutron OpenvSwitch plugin with GRE overlay networks.

We run multiple tenants with the same IP address ranges accessed via
servers assigned floating IP's; I noticed that I kept getting bumped
from my access server and dug in a bit further in the l3 router
namespace on the gateway node; the arp address of the server was
switching to a port assigned to another tenants instance, indicating
some sort of cross l2 network pollution between tenants.

I appear to have cleaned this up by running:

   sudo neutron-ovs-cleanup

on the compute host that had the other tenants instance and then hard
rebooting all of the instances running on that host to re-connect all
of the instances.

I noticed alot of cruft on the integration bridge; this is taken from
a host where I have not done the cleanups steps:

ubuntu@ciguapa:~$ sudo ovs-vsctl show
8aa44160-224e-41fe-9b54-92c9d3e779bb
    Bridge br-int
        Port "qvoff030e8d-73"
            tag: 4095
            Interface "qvoff030e8d-73"
        Port "tap15d5f03d-af"
            tag: 1
            Interface "tap15d5f03d-af"
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port "qvo15d5f03d-af"
            Interface "qvo15d5f03d-af"
        Port "tapc143c034-e0"
            tag: 3
            Interface "tapc143c034-e0"
        Port "qvo1b3f5a5f-60"
            tag: 4095
            Interface "qvo1b3f5a5f-60"
        Port "qvod43a627c-a0"
            Interface "qvod43a627c-a0"
        Port "tapd43a627c-a0"
            tag: 2
            Interface "tapd43a627c-a0"
        Port "qvo8162d068-ce"
            tag: 4095
            Interface "qvo8162d068-ce"
        Port "qvoc143c034-e0"
            Interface "qvoc143c034-e0"
        Port br-int
            Interface br-int
                type: internal
        Port "qvoc2e6f8a5-56"
            tag: 4095
            Interface "qvoc2e6f8a5-56"

I guess this might be an artifact of upgrading from b3->RC1->RC2 but
it feels pretty nasty to me.

James Page (james-page)
tags: added: havana-rc-potential
Changed in neutron:
assignee: nobody → Mark McClain (markmcclain)
assignee: Mark McClain (markmcclain) → Kyle Mestery (mestery)
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

For fixing bug 12240001 I've disabled arping by default, because it was crashing the kernel under load.
This can be easily restored in the agent configuration; I don't know if that's the root cause, but the unsolicited ARP will update the ARP cache in the broadcast domain of the logical router.

I see many of the interfaces on br-int that you've posted have been put by the agent on the 'dead vlan' (4095). Could they belong to VMs which did not properly shutdown?

Revision history for this message
Kyle Mestery (mestery) wrote :

@salvatore-orlando: I'm not sure how removing the GARP is related here, but I think it's easy enough for James to try this out and see if it affects things for him. One thing which concerns me is that for this cross-tenant pollution to happen on an L2 boundary, something must be incorrectly setting up local VLANs on the hosts, or leaving ports behind which are responding to ARP requests. James, is it possible you still had L3 agents running from before your upgrade when this happened?

Revision history for this message
James Page (james-page) wrote : Re: [Bug 1240066] Re: cross tenant network polution post upgrade to Havana RC2

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 15/10/13 19:51, Kyle Mestery wrote:
> @salvatore-orlando: I'm not sure how removing the GARP is related
> here, but I think it's easy enough for James to try this out and
> see if it affects things for him. One thing which concerns me is
> that for this cross-tenant pollution to happen on an L2 boundary,
> something must be incorrectly setting up local VLANs on the hosts,
> or leaving ports behind which are responding to ARP requests.
> James, is it possible you still had L3 agents running from before
> your upgrade when this happened?

I tried switching the GARP requests back on at Marks suggestion; this
made no difference to the situation; but that said I had already
cleared the problem by running the neutron-ovs-cleanup process I
described on the compute nodes.

The install is based on the havana cloud archive packages, so the l3
agent would have been shutdown prior to upgrade and then started
afterwards so I think its unlikely.

- --
James Page
Ubuntu and Debian Developer
<email address hidden>
<email address hidden>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJSXazKAAoJEL/srsug59jDSP8P/1929YHSQ02oeJOVqc7Zo3TL
OZxxyhbk/bTeVzn+YF6SRVmhg96sdwOg9BAgTnLeZK0QUyBx9hdjjez859jcWMrY
e9hlRTDCPs4jUIOGE5li6EQvPvgolhdmzwfEglpX8U8SmHusLKVRIeeYnB98vjMd
N4s/V/YjQ1Ss1FIr2hmis+tzM2/ff29DZu27OOP58qn1aktecycluJP9kvJYXtDw
MTugFRXfzDPoMk+OewNtVwIllbPWvTur+KYQNwoz24CiUJPuhDkgeFuvLXBbgGfD
44j6YSLPCb2r0GDqPza6zRZf7xYDI1qPYRATQiwkVmIexKJCiBBV8Z5FgxpnsyYw
UgLVMz3yEE9BIV7fF6c7naPw8PU0iOplfSqZ/pTFwj0fxB0WBEr4FRGDNViNct0h
254PfuXLG0YkyPeGOErOx6372CejIKt+Gt8ojWJ1Udy+4HRsziEMi6n51KWAC4/t
rrz4NOinj2bZ0g4/mLl4Zglt1ZPlRBSr9IPgxnfk9tOkW9myOHqAlhP+YzvdRpek
aPs3BbjdOgtQ9zvu0ijPfFe5f139svW+jFIOsiaih0gFSSmu3Zotc6p5J4KLpQcN
Eo6Vwr/T1ywHlTUSDYcEoq+ReBTJdV0C+uHlHa4f7mnTOsF3zWULPej7Eo//tWGw
rz6dOu3sv2WwcYowBGpC
=/bSj
-----END PGP SIGNATURE-----

Revision history for this message
Kyle Mestery (mestery) wrote :

James, a question for you: When you did the upgrade, did you power off all the VMs, or did you simply leave things running, stop the existing services, load the new RC2 code, and then start the new services? It's unclear to me if that would be a supported upgrade model in Neutron, and OpenStack in general. I'll see if I can dig that out or not, but getting clarification on this point would be great.

Thierry Carrez (ttx)
tags: added: havana-backport-potential
removed: havana-rc-potential
Revision history for this message
James Page (james-page) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 17/10/13 15:50, Kyle Mestery wrote:
> James, a question for you: When you did the upgrade, did you power
> off all the VMs, or did you simply leave things running, stop the
> existing services, load the new RC2 code, and then start the new
> services? It's unclear to me if that would be a supported upgrade
> model in Neutron, and OpenStack in general. I'll see if I can dig
> that out or not, but getting clarification on this point would be
> great.

I just upgraded the code; so stop services, upgrade, start services.
No instances where stopped during the upgrade.

I *really* hope that it is a supported upgrade model...

- --
James Page
Ubuntu and Debian Developer
<email address hidden>
<email address hidden>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJSYCVbAAoJEL/srsug59jD0ZsP/R2VTLG32xW/+cK5qPFIvOZm
H2JWMRfi6IQyDUB+WVEQlb2GNbWUVhzpkvH9Fixr/0314ud5+fhqTF9NIM8Iu91N
W/W9gQML3AfGfC6YwS0hBie60hEhqbulFIfbcg0cj9j20H4qLOBJMyGpgfueRXAy
JE50L1rq/0moJUc5LfOhsdLKgROa4r201JYL9V/tQpUxG7moXiwcy0imSs4U/Ohn
FEJbuPdUScEbmddTwNYAGaMD8/eYS0Oq/NRT9ov5KoZIVR14ROFbfTW0ZW1znb19
DECEBKmSrgTarHzuzwTZA5KKuCPrllccVKn8C1upm9AQIfci1dwt9vZ4jcYCfTvW
m+RG8Wfp3u5avY59c9pWMUNHI+EMzQCbwk+9OuEs0i1j8KuswhKcrdXt/xcZoPxe
Fq1tMnbrv7oIOCdwn1vIGerort0yoKrYfv7Udzyp5RvsdAMaEp2Y198yNDSSxweI
wLs9fKcBkJt/vkbKjmrIpnodBWOu8oHAImBe7BAujlY4OfePgokc7NZscmYAzZVA
0aTT1biL+RsiiXzRaiK1EBtyIg/ps2F6NbX4CqYQqBo4kuwe/8t2pXJhPVydf6XU
xJCR9f9ZVdvgwXKRXB0dBGsyiKJgyVTmwvhdEKjuV+23T2TIbxAoazSz/uiNmruc
4+vSAS+XxU3RV4E5IKSb
=mFEm
-----END PGP SIGNATURE-----

Revision history for this message
Thierry Carrez (ttx) wrote :

Waiting on debunking before deciding if this is a security issue or not.

Changed in ossa:
status: New → Incomplete
Revision history for this message
Thierry Carrez (ttx) wrote :

@James: have you seen the issue ever again ? I would tend to open this bug at this point so that it gets more publicity... If it's not easily reproducible I wouldn't count it as an exploitable vulnerability.

Revision history for this message
James Page (james-page) wrote :

I've not managed to repeat this issue, so I suspect something transient upgrading between RC's.

Lets get some more publicity....

Revision history for this message
Thierry Carrez (ttx) wrote :

Feel free to make it "public security" again if you feel it has clear security impact

information type: Private Security → Public
no longer affects: ossa
Revision history for this message
Mark McClain (markmcclain) wrote :

A few Neutron team members tried to replicate this bug and have been unable to do so thus far.

tags: added: ovs
Changed in neutron:
status: New → Incomplete
Kyle Mestery (mestery)
Changed in neutron:
assignee: Kyle Mestery (mestery) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.