Eavesdropping private traffic

Bug #1734320 reported by Paul Peereboom on 2017-11-24
48
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
sean mooney
OpenStack Security Advisory
Undecided
Unassigned
neutron
High
sean mooney
os-vif
Undecided
sean mooney

Bug Description

Eavesdropping private traffic
=============================

Abstract
--------

We've discovered a security issue that allows end users within their own private network to receive from, and send traffic to, other private networks on the same compute node.

Description
-----------

During live-migration there is a small time window where the ports of instances are untagged. Instances have a port trunked to the integration bridge and receive 802.1Q tagged private traffic from other tenants.

If the port is administratively down during live migration, the port will remain in trunk mode indefinitely.

Traffic is possible between ports is that are administratively down, even between tenants self-service networks.

Conditions
----------

The following conditions are necessary.

* Openvswitch Self-service networks
* An Openstack administrator or an automated process needs to schedule a Live migration

We tested this on newton.

Issues
------

This outcome is the result of multiple independent issues. We will list the most important first, and follow with bugs that create a fragile situation.

Issue #1 Initially creating a trunk port

When the port is initially created, it is in trunk mode. This creates a fail-open situation.
See: https://github.com/openstack/os-vif/blob/newton-eol/vif_plug_ovs/linux_net.py#L52
Recommendation: create ports in the port_dead state, don't leave it dangling in trunk-mode. Add a drop-flow initially.

Issue #2 Order of creation.

The instance is actually migrated before the (networking) configuration is completed.

Recommendation: wait with finishing the live migration until the underlying configuration has been applied completely.

Issue #3 Not closing the port when it is down.

Neutron calls the port_dead function to ensure the port is down. It sets the tag to 4095 and adds a "drop" flow if (and only if) there is already another tag on the port. The port_dead function will keep untagged ports untagged.

https://github.com/openstack/neutron/blob/stable/newton/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L995

Recommendation: Make port_dead also shut the port if no tag is found. Log a warning if this happens.

Issue #4 Putting the port administratively down actually puts the port on a compute node shared vlan

Instances from different projects on different private networks can talk to each other if they put their ports down. The code does install an openflow "drop" rule but it has a lower priority (2) than the allow rules.

Recommendation: Increase the port_dead openflow drop rule priority to MAX

Timeline
--------

 2017-09-14 Discovery eavesdropping issue
 2017-09-15 Verify workaround.
 2017-10-04 Discovery port-down-traffic issue
 2017-11-24 Vendor Disclosure to Openstack

Steps to reproduce
------------------

1. Attach an instance to two networks:

admin$ openstack server create --nic net-id=<net-uuid1> --nic net-id=<net-uuid2> --image <image_id> --flavor <flavor_id> instance_temp

2. Attach a FIP to the instance to be able to log in to this instance

3. Verify:

admin$ openstack server show -c name -c addresses fe28a2ee-098f-4425-9d3c-8e2cd383547d

+-----------+-----------------------------------------------------------------------------+
| Field | Value |
+-----------+-----------------------------------------------------------------------------+
| addresses | network1=192.168.99.8, <FIP>; network2=192.168.80.14 |
| name | instance_temp |
+-----------+-----------------------------------------------------------------------------+

4. Ssh to the instance using network1 and run a tcpdump on the other port network2

[root@instance_temp]$ tcpdump -eeenni eth1

5. Get port-id of network2

admin$ nova interface-list fe28a2ee-098f-4425-9d3c-8e2cd383547d
+------------+--------------------------------------+--------------------------------------+---------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+---------------+-------------------+
| ACTIVE | a848520b-0814-4030-bb48-49e4b5cf8160 | d69028f7-9558-4f14-8ce6-29cb8f1c19cd | 192.168.80.14 | fa:16:3e:2d:8b:7b |
| ACTIVE | fad148ca-cf7a-4839-aac3-a2cd8d1d2260 | d22c22ae-0a42-4e3b-8144-f28534c3439a | 192.168.99.8 | fa:16:3e:60:2c:fa |
+------------+--------------------------------------+--------------------------------------+---------------+-------------------+

6. Force port down on network 2

admin$ neutron port-update a848520b-0814-4030-bb48-49e4b5cf8160 --admin-state-up False

7. Port gets tagged with vlan 4095, the dead vlan tag, which is normal:

compute1# grep a848520b-0814-4030-bb48-49e4b5cf8160 /var/log/neutron/neutron-openvswitch-agent.log | tail -1
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-e008feb3-8a35-4c97-adac-b48ff88165b2 - - - - -] VIF port: a848520b-0814-4030-bb48-49e4b5cf8160 admin state up disabled, putting on the dead VLAN

8. Verify the port is tagged with vlan 4095

compute1# ovs-vsctl show | grep -A3 qvoa848520b-08
      Port "qvoa848520b-08"
          tag: 4095
          Interface "qvoa848520b-08"

9. Now live-migrate the instance:

admin# nova live-migration fe28a2ee-098f-4425-9d3c-8e2cd383547d

10. Verify the tag is gone on compute2, and take a deep breath

compute2# ovs-vsctl show | grep -A3 qvoa848520b-08
      Port "qvoa848520b-08"
          Interface "qvoa848520b-08"
      Port...
compute2# echo "Wut!"

11. Now traffic of all other self-service networks present on compute2 can be sniffed from instance_temp

[root@instance_temp] tcpdump -eenni eth1
13:14:31.748266 fa:16:3e:6a:17:38 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.152, length 28
13:14:31.804573 fa:16:3e:e8:a2:d2 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.70, length 28
13:14:31.810482 fa:16:3e:95:ca:3a > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.154, length 28
13:14:31.977820 fa:16:3e:6f:f4:9b > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.150, length 28
13:14:31.979590 fa:16:3e:0f:3d:cc > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 9, p 0, ethertype ARP, Request who-has 10.103.9.163 tell 10.103.9.1, length 28
13:14:32.048082 fa:16:3e:65:64:38 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.101, length 28
13:14:32.127400 fa:16:3e:30:cb:b5 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.165, length 28
13:14:32.141982 fa:16:3e:96:cd:b0 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.100, length 28
13:14:32.205327 fa:16:3e:a2:0b:76 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.153, length 28
13:14:32.444142 fa:16:3e:1f:db:ed > 01:00:5e:00:00:12, ethertype 802.1Q (0x8100), length 58: vlan 72, p 0, ethertype IPv4, 192.168.99.212 > 224.0.0.18: VRRPv2, Advertisement, vrid 50, prio 103, authtype none, intvl 1s, length 20
13:14:32.449497 fa:16:3e:1c:24:c0 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.20, length 28
13:14:32.476015 fa:16:3e:f2:3b:97 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.22, length 28
13:14:32.575034 fa:16:3e:44:fe:35 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.163, length 28
13:14:32.676185 fa:16:3e:1e:92:d7 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.150, length 28
13:14:32.711755 fa:16:3e:99:6c:c8 > 01:00:5e:00:00:12, ethertype 802.1Q (0x8100), length 62: vlan 10, p 0, ethertype IPv4, 10.103.12.154 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 49, authtype simple, intvl 1s, length 24
13:14:32.711773 fa:16:3e:f5:23:d5 > 01:00:5e:00:00:12, ethertype 802.1Q (0x8100), length 58: vlan 12, p 0, ethertype IPv4, 10.103.15.154 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 49, authtype simple, intvl 1s, length 20

Workaround
----------

We temporary fixed this issue by forcing the dead vlan tag on port creation on compute nodes:

/usr/lib/python2.7/site-packages/vif_plug_ovs/linux_net.py:

def _create_ovs_vif_cmd(bridge, dev, iface_id, mac,
                        instance_id, interface_type=None,
                        vhost_server_path=None):
+ # ODCN: initialize port as dead
+ # ODCN: TODO: set drop flow
    cmd = ['--', '--if-exists', 'del-port', dev, '--',
            'add-port', bridge, dev,
+ 'tag=4095',
            '--', 'set', 'Interface', dev,
            'external-ids:iface-id=%s' % iface_id,
            'external-ids:iface-status=active',
            'external-ids:attached-mac=%s' % mac,
            'external-ids:vm-uuid=%s' % instance_id]
    if interface_type:
        cmd += ['type=%s' % interface_type]
    if vhost_server_path:
        cmd += ['options:vhost-server-path=%s' % vhost_server_path]
    return cmd

https://github.com/openstack/neutron/blob/stable/newton/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L995

    def port_dead(self, port, log_errors=True):
        '''Once a port has no binding, put it on the "dead vlan".

        :param port: an ovs_lib.VifPort object.
        '''
        # Don't kill a port if it's already dead
        cur_tag = self.int_br.db_get_val("Port", port.port_name, "tag",
                                         log_errors=log_errors)
+ # ODCN GM 20170915
+ if not cur_tag:
+ LOG.error('port_dead(): port %s has no tag', port.port_name)
+ # ODCN AJS 20170915
+ if not cur_tag or cur_tag != constants.DEAD_VLAN_TAG:
- if cur_tag and cur_tag != constants.DEAD_VLAN_TAG:
           LOG.info('port_dead(): put port %s on dead vlan', port.port_name)
           self.int_br.set_db_attribute("Port", port.port_name, "tag",
                                         constants.DEAD_VLAN_TAG,
                                         log_errors=log_errors)
            self.int_br.drop_port(in_port=port.ofport)

plugins/ml2/drivers/openvswitch/agent/openflow/ovs_ofctl/ovs_bridge.py
    def drop_port(self, in_port):
+ # ODCN AJS 20171004:
- self.install_drop(priority=2, in_port=in_port)
+ self.install_drop(priority=65535, in_port=in_port)

Regards,

ODC Noord.
Gerhard Muntingh
Albert Siersema
Paul Peereboom

CVE References

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

Changed in ossa:
status: New → Incomplete
description: updated

Looking.

My fear is that this vulnerability might be present on releases newer than Newton.

Kevin Benton (kevinbenton) wrote :

The other thing that surprised me about this is that there are rules forwarding traffic to the port when it doesn't have a VLAN tag. Is the behavior of the NORMAL action to flood to all untagged ports as if they were trunks carrying all VLANs?

The other suggested fixes all seem to be working under the assumption that traffic will flow to ports if no rules are installed. I think it would be safest to ensure that a newly added port to br-int doesn't receive any traffic until flows are setup. Going that route would also avoid having to ban all earlier versions of os-vif because a VLAN tag would not need to be set on plugging.

Should we subscribe the Nova project for the issue #2 (Order of creation)?

Also, isn't instance migration an admin only operation? If so, can a regular user abuse this bug without an admin cooperation?

@Tristan: perhaps looping in the nova project is not a bad idea, and yes, you're right that migration can only be performed by an admin according to the default policy.json.

Today I was looking at this a bit to see if 'resizing' would expose the same behavior, as that is available to regular users. I shall be able to find more time tomorrow to assess what's going on pike and ocata.

Thanks Armando for the prompt feedback. I've subscribed nova-coresec to discuss the scope of this vulnerability, in particular the issue #2.

@nova-coresec, is it correct to assume migration operation are restricted to admin? If a deployment does authorize regular user to do migration operation, wouldn't they be vulnerable to other unexpected issues anyway?

Paul Peereboom (peereb) wrote :

@Tristan: yes nova live-migration is by default an admin only operation. But forcing the port-down can be done by a user aswell. Then it's just waiting for an admin to do a live-migration of a instance. We live-migrate instances regularly to empty a compute node and patch the compute node.

Block migration would also be affected by the same behavior.

@Paul, well I tend to agree this is a vulnerability (class A according to the VMT taxonomy: http://security.openstack.org/vmt-process.html#incident-report-taxonomy ). However, since a malicious user can't control the condition of exploitation, other VMT member may disagree with issuing an advisory for this issue.

I guess it boils down to how likely a user can snoop sensitive traffic.

That's why I am trying to verify that the resize operation has the same behavior, because if that's the case, the user is very much in control!

Paul Peereboom (peereb) wrote :

Agree lets see if we can reproduce without admin credentials.

I used a Pike-based deployment as a testing platform and I can confirm that the resize operation exposes the same behavior as live/block migration (as reported in this report), i.e. the port connected to the VM shows up without a local VLAN tag on the target host.

I still have to look at the extent of the damage that the user can inflict. So, it looks like the malicious user can be in full control of the condition of exploitation.

Thanks Armando,

so the next question is, can this be fixed across all the supported stable release? (e.g.: pike and ocata) ?
And is there something to fix in the os-vif and nova project too?

Hi Tristan,

I wanted to look a bit deeper at what happens on the datapath, but I keep getting sidetracked. The report and analysis from Paul has been excellent, and I welcome that level of detail very much, but I want to see this through a bit more to formulate a recommendation.

Thanks for everyone involved!

OK, I finally had some interrupted time in which I could look into this more carefully. the TL;DR summary is: the behavior is exposed in live-migrate/block-migrate/resize operations. That said, I don't believe this vulnerability is serious at least under default conditions. Here's why:

1) live/block migration is an admin operation: if the admin avoids turning a port down, the vulnerability is not exposed. Turning a port down is not strictly necessary. Furthermore block migration is disruptive, in that users loses console and network connectivity to the instance.

2) resize is a user allowed operation, but the operation leads to traffic disruption. Even though a user can explicitly turn a port DOWN and resize the instance without admin intervention, there's no way she can keep connectivity to the instance while the operation is in progress. Turning the port back UP will reinstate the local VLAN tag and restore connectivity. The loophole at that point is closed.

For these reasons, I would cautiously say that this vulnerability is not easily exploitable, but we do want to warn the admin of the potential loophole (OSSN B2?), and eventually address the neutron OVS agent codebase to ensure that a port is always tagged even when it's in ADMIN_DOWN after a migration/resize operation.

I'd recommend against changing os-vif to start a port on the dead vlan for two reasons:

a) neutron use cases go beyond just spinning up VMs. So if possible we should find a fix confined within neutron alone.

b) I feel that the os-vif fix as proposed is ineffective in closing any timing window because while the port is untagged, it's practically unlikely that a user can get hold of the console remotely.

I am open to feedback on my recommendation, but in the meantime I would like to thank the reporter for the very detailed, well thought and challenging report. I had fun cracking this one. Keep them coming!

Needless to say, open to further discussion.

@Tristan: what's next steps? Is this OSSA then marked CONFIRMED? I would like to involve Ihar and Miguel for what comes next. Unless my rationale is flawed, I consider my triage complete and I'd like to push the ball into someone else's court.

Cheers,
Armando

Changed in neutron:
status: New → Triaged
Changed in nova:
status: New → Confirmed
Changed in neutron:
importance: Undecided → Low

Thank you Armando for this thorough investigation.

You mention console access as a requirement, but couldn't the user run nohup task in the background to capture the traffic, and then retrieve the data later?

To answer your last question, if this is triaged as B2, then we would subscribe the ossg-coresec group so that an OSSN could be prepared and may be sent to the downstream stakeholders through the embargo-notice mailing list before disclosing this bug. Though the live migration case may be bad enough (e.g.: for operator going through that process often) that we may want to issue an advisory instead.

Please feel free to subscribe more neutron or nova developers to this bug, at least to check for unforeseen scenarios.

Gerhard Muntingh (gerhard-1) wrote :

Hi Armando, I'm also impressed by the thorough investigation. Especially the resize insight.

Please carefully read point two below. I think the issue is more serious because of it.

1) Turning the port down (as a user) is not necessary, but it changes the trunk-mode time from a couple of seconds to indefinitely.

2) There's no need to put the port (eth0) back up, if the attacker creates a secondary network interface (eth1) to access the instance. No console access is needed. This is the described scenario in the original report.

Thanks,
Gerhard.

I didn't mean that console would be a requirement for the exploit, and indeed I overlooked the two-NICs scenario. Thanks for your feedback...I feared I was missing something ;)

That said, I have been thinking if ensuring a tag on the port after resize/migration operation suffices to mitigate the exploitation.

Ihar identified snippet of code [1] as potential contentious, and I wonder if along location [2] would be the only two areas required to be hardened.

Thoughts?
Thanks,
Armando

[1] https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1426-L1436
[2] https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L995

Gerhard Muntingh (gerhard-1) wrote :

Hardening both locations doesn't fix the issue. It brings the time window back to a couple of seconds.

Out of curiosity, can't nova-compute skip the os-vif step, and have neutron configure the port entirely?

Looping will also work.

Also, don't forget the less serious issue #4.

I agree with Gerhard on the fact that hardening on neutron side won't completely eliminate the issue in nova/neutron setup. We should also have a fix on os-vif side where we enforce dead tag on newly created ports. AFAIU os-vif may not have been used before recent releases of Nova so the fix may need to be both in os-vif and nova stable branches.

As for #4, I think it will not be an issue if we enforce dead tag when putting a port down. (The problem is currently we don't do it for unbound ports.)

If I try to capture all the things we may need to patch, I get the following:
- (neutron) port_dead to always set dead tag on new unbound ports;
- (os-vif/nova) enforce dead tag for new ovs ports.

The first patch fixes the issue on neutron side for setups that are not using Nova / os-vif. In those setups, components calling to Neutron should make sure newly created ports are dead, otherwise they are still exposed to a (short) vulnerable window. (Should we include this info in a release note / security report? Should we reach out to Neutron API consumers that may be affected?)

The second fix will completely close the short vulnerable window in neutron/nova setups discussed above.

(I actually feel that those two issues are independent and should be treated as two separate CVEs targeting different components.)

Then additional hardening patches could also be:
- (neutron) use drop rule for normal action on br-int; (may be invasive / breaking external code);
(and/or)
- (neutron) bump priority for drop flow rule set for dead ports.

The hardening may happen in public since neither of those issues should expose anything as long as os-vif/neutron patches as described before.

Does it make sense?

Is anyone from Nova team aware of the issue?

Ihar, I agree with your assessment, thanks for pushing this forward; I am a bit uneasy about the implications of adding a drop rule for normal actions on br-int for the reasons you've stated (maybe that breaks Tap as a Service or SFC or other stuff?). We should also make sure that the change on the os-vif side will not break things like OVN/ODL or other drivers that rely on OVS because I don't believe there's any local VLAN tagging involved for those. Have you considered these scenarios?

To answer Gerhard's question about os-vif, iirc the rationale of having nova do the plugging is to enable scenarios where there's no network agent involved on the neutron side and/or to abstract the networking service from hypervisor internals.

Gerhard Muntingh (gerhard-1) wrote :

Thanks, that makes sense.

I suspect adding a drop rule for normal action on br-int will require quite some explicit flows to deal with all the other bridges (br-ex), but we'll see.

On neutron hardening: Bumping priority for drop flow rule set for dead ports will fix issue #4.

Miguel Lavalle (minsel) wrote :

Thanks to everybody who has helped in diagnosing this issue. It seems we have two sets of actions:

1) Fixing: that involves port_dead to always set dead tag on new unbound ports in Neutron and enforce dead tag for new ovs ports in nova / os-vif

2) Hardening of Neutron code

I suggest that as the very next step we concentrate in the fixing part. For this the Neutron side seems to be uncontroversial. The only sticking point left is the potential breakage of other back-ends with the os-vif fix. Can we involve someone from OVN to see what the implications might be?

Jeremy Stanley (fungi) wrote :

Please subscribe anyone you think may be able to help identify a safe solution.

I subscribed Miguel Angel Ajo from networking-ovn team to assess whether proposed os-vif fix would break them.

(something we discussed with Armando yesterday)

if for some reason the suggested change to os-vif to mark all new ports with dead tag is breaking other projects, we may roll it in as opt-in, where consumer passes some flag through binding profile dictionary that would instruct nova to also mark a new port as dead; then the flag would be proxied into os-vif binding code that would enact the request. If the flag was not set / passed disabled, then nova would not request port death, and os-vif would do the same action as it does right now.

There are several compatibility issues to be concerned about here.

1. we effectively change api payload for binding extension; but because we already treat profile dict as a large box of random crap to pass around between neutron and nova, unpatched nova should already be ready to receive a dict with unknown keys.

2. it may be that os-vif is not patched while nova is, which would potentially lead to binding call to os-vif failing because of unknown flag passed by nova. I don't think we can bump minimal supported os-vif version in requirements.txt, so nova will need to find a way to deal with older releases. If nothing else, we can always catch TypeError: ... expected at least X arguments, got Y error and fall back to flag-less call to os-vif (plus a warning).

This ^ would of course imply that instead of having one/two places to patch (os-vif + maybe neutron), we will have three (neutron -> nova -> os-vif).

Miguel Angel Ajo (mangelajo) wrote :

Hi,

Thank you very much for looping me in,

I'm not sure if networking-ovn, or other plugins that don't use tags+NORMAL-rules in OVS would work with the os-vif change (which as far as I undertood creates the ports right away in the dead vlan).

Could we also loop Numan Siddique to verify how ovn behaves in this case?, tomorrow we have a bank holiday in Spain, and I will be physically away from my computer until late.

The question is if the "vlan tag" will be just ignored based on the used OpenFlow rules, or what happens exactly.

Jeremy Stanley (fungi) wrote :

Miguel: my request from comment #26 definitely applies to you as well. Please subscribe anyone you think may be able to help identify a safe solution.

Miguel Angel Ajo (mangelajo) wrote :

Sorry, I didn't know that I had permissions to add more people, I did it with numan, but I will verify now that I'm back from PTO

Miguel Angel Ajo (mangelajo) wrote :

[root@primary ~]# ovs-vsctl show
a6b77a09-2647-47d7-8815-ef4d4f689ce8
    Bridge br-int
        fail_mode: secure
        Port "tap5a27427b-22"
            Interface "tap5a27427b-22"
        Port br-int
            Interface br-int
                type: internal
        Port "patch-br-int-to-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e"
            Interface "patch-br-int-to-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e"
                type: patch
                options: {peer="patch-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e-to-br-int"}
        Port "tapa4e1ef4d-40"
            tag: 4095
            Interface "tapa4e1ef4d-40"
        Port "tap4b56611b-99"
            tag: 4095
            Interface "tap4b56611b-99"
        Port "tap473919fe-31"
            tag: 4095
            Interface "tap473919fe-31"
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port "patch-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e-to-br-int"
            Interface "patch-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e-to-br-int"
                type: patch
                options: {peer="patch-br-int-to-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e"}
[root@primary ~]# ip netns exec ovnmeta-a4e1ef4d-47ce-4b92-8043-70d88237eff1 ssh cirros@10.0.0.4
cirros@10.0.0.4's password:

[root@primary ~]# ip netns exec ovnmeta-a4e1ef4d-47ce-4b92-8043-70d88237eff1 ssh cirros@10.0.0.6
The authenticity of host '10.0.0.6 (10.0.0.6)' can't be established.
RSA key fingerprint is SHA256:cDLkQEB0LfxZfIvpd084MucUa4uohUd0COf3ArPa1A0.
RSA key fingerprint is MD5:cd:4e:f7:f5:e1:bb:61:ea:a7:4d:46:f8:67:43:20:00.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.0.6' (RSA) to the list of known hosts.
cirros@10.0.0.6's password:

[root@primary ~]# ip netns exec ovnmeta-a4e1ef4d-47ce-4b92-8043-70d88237eff1 ssh cirros@10.0.0.8
The authenticity of host '10.0.0.8 (10.0.0.8)' can't be established.
RSA key fingerprint is SHA256:XUX8EfLF2oRJLqDChEEw3smHGeHm7zcQapdpayZcb0Y.
RSA key fingerprint is MD5:33:00:41:fb:96:25:a4:79:b5:2e:63:c8:03:8f:2e:be.
Are you sure you want to continue connecting (yes/no)? ^C
[root@primary ~]#
[root@primary ~]#

Sorry for the delay, I've verified that the solution would be fine for OVN (and I suspect that also for other openflow based solutions which don't use the "NORMAL" rule).

Miguel Lavalle (minsel) wrote :

@Tocayo Miguel Ajo,

Thanks for the testing. Much appreciated

Based on this I think we can move ahead with the "fixing" part, which involves:

1) port_dead to always set dead tag on new unbound ports in Neutron
2) Enforce dead tag for new ovs ports in nova / os-vif

What are the mechanics of creating, reviewing and merging these two patches under the embargo conditions of this CVE? I am available to work on this

I'll still loop in someone from ODL to make sure they are fine with the proposed approach

Jeremy Stanley (fungi) wrote :

If the issue is considered widespread, easy enough to exploit and able to do a sufficient amount of damage or obtain sensitive information with some certainty, then we would want to keep this report privately embargoed while fixes are developed, attached to this bug and tested/reviewed locally by the bug's subscribers. Once working fixes for master are identified and backports are created and attached to this bug for all supported stable branches, we will get a CVE identifier assigned, propose a disclosure schedule and provide advance notice and copies of the patches privately to downstream stakeholders.

If the risk from this bug is of questionable severity, we should subscribe the ossg-coresec team to get their opinion on the possibility of switching to our (simpler) public workflow for this report instead.

As a reminder, our process for both of these options is described here: https://security.openstack.org/vmt-process.html

I tried to reach out to Matt (Nova PTL) via email, so far no luck. I think their involvement is crucial here.

This is the neutron piece of the puzzle to make ovs agent mark all untagged ports as dead. I am considering adding a new functional or fullstack test for the scenario too, but posting what I have for now to start review.

In nova, there is also some code in nova/network/linux_net.py (specifically, class LinuxOVSInterfaceDriver) that seems to have plug() method that calls add-port and would benefit from dead tag setting too, but I failed to find a place where this code is hooked into actual services, so I wonder if this code is not some dead end that will be cleaned up in due course after os-vif switch. I will leave it up to nova team to decide.

Backporting for neutron seems simple: the patch applies cleanly to stable/ocata which is the oldest version upstream supports.

Since I also have an interest in earlier releases that may still be supported by Red Hat, I also checked that the patch applies with some adjustments up to liberty. As for earlier releases, they are probably not affected because they don't include https://review.openstack.org/#/q/I5ef9665770df3a9bbaf79049b219fadd73e20309 that made neutron ovs agent skip tagging ports as dead if they don't have a tag in the first place.

Which makes me think that we should make sure that the CVE fix doesn't break the assumptions that the patch I linked to above made. The concern there is, as far as I understand, is that drop flows are left on the bridge even after port is gone, which may hinder performance etal.

As for nova / os-vif backporting, someone from nova team should assess that too. I believe because of os-vif switch in latest cycles, there may be some work needed to adopt the change I posted to older branches.

Changed in neutron:
importance: Low → High
Matt Riedemann (mriedem) wrote :

Sorry for being late, I just read up on all of this.

Just to clarify from the original description:

> Issue #2 Order of creation.
>
> The instance is actually migrated before the (networking) configuration is completed.
>
> Recommendation: wait with finishing the live migration until the underlying configuration has been > applied completely.

We don't need to try something here on the nova side ^? And what exactly is "networking configuration is completed"? Does that mean waiting for the vif-plugged event from neutron when nova plugs the vifs on the destination compute before starting the live migration in the hypervisor? If so, that's something we've been talking about doing for awhile anyway, and is part of this spec for using the neutron port binding API during live migration to reduce network down time:

https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/neutron-new-port-binding-api.html

Waiting for vif-plugged events during live migration would be a backportable change, that entire spec would not.

The vif plugging dance is a bit more fuzzy to me wrt cold migrate / resize. I see that the libvirt driver doesn't wait for the vif-plugged event on the destination host:

https://github.com/openstack/nova/blob/bd3da5d763d2dcb0426e9acaa419e9e9574569ca/nova/virt/libvirt/driver.py#L8088-L8098

I haven't seen anyone really touch on this part of the issue since the original description though so if it's not critical to the fix here, ignore my questions.

--

Regarding the changes to os-vif, I'm definitely no expert in that code. I'll add Sean Mooney, Jay Pipes and Dan Smith to this review. I'm not sure what the backport impacts might be, but nova has been using os-vif for several releases now (since newton: https://review.openstack.org/#/c/269672/ ).

As Ihar noted earlier, we probably can't bump the minimum required version of os-vif in global-requirements on stable branches, so we might have to make the nova code deal with a scenario that the change to os-vif doesn't exist (not sure if that would show up as a TypeError or AttributeError or what from os-vif - would need to be tested).

sean mooney (sean-k-mooney) wrote :

reading through the recommendations i am not sure that using a drop rule and null vlan(4095) is the correct approch.

refering to http://openvswitch.org/support/dist-docs/ovs-ofctl.8.txt
using
ovs-ofctl mod-port down|up
would appear to be more correct as it disables the tx and rx quese in the interface.

a drop rule would only prevent the vm from transmitting packet it would not prevent the vm from reciving traffic as openflow does not allow you to match on output port.

my only concern with ovs-ofctl mod-port down is that odl/ovn and any other ml2 driver that uses vif_ovs would also need this change to ovs-ofctl mod-port up when they are finished.
granted they would likely also need to be modifed to fix the 4095 vlan and remove the drop rule too.

so to resolve Issue #1 Initially creating a trunk port
i would suggest that when we create teh port we initially set it down and have the ovs agent set up as part of wiring up the port.

Issue #2 Order of creation should be resolved by the multiple port binding spec i believe.

Issue #3 Not closing the port when it is down.

I think this can be resolved by updating neutron to use ovs-ofctl mod-port down|up to actually disable the port in ovs instead of using the null vlan.

Issue #4 Putting the port administratively down actually puts the port on a compute node shared vlan
issue 4 will be resulve by moving to ovs-ofctl mod-port down|up
however if you are currently able to transmit/recive packet in a vm that is attached to vlan 4095 i think you have a ovs bug. vlan 4095 is reserved as the null/dead vlan and ovs is drop all packet to/from interfaes with a tag of 4095.

my main concern though is that solving this isuue in os-vif in this way will not solve it for all usecase. when the neutron vif_type is ovs but hybrid_plug is set to false os-vif is not used to plug the port. this configuration is used when kernel ovs is installed on the has and neutron is configured to use either the noop security group driver or the openvsiwtch conntrack driver is used.

when vif_type=ovs and hybird_plug=false then libvirt is used to plug the vif into ovs directly.
libvirt does not set the state of the port or the mtu(that's a different bug) i we want to cover all edgecase we need to finally convert this codepath to use os-vif to handel the vif plugging when vif_type=ovs and hybird_plug=false.

Gerhard Muntingh (gerhard-1) wrote :

any news on this issue?

Jeremy Stanley (fungi) wrote :

Given this discussion has been going on for 5 months already, I've lost much hope of seeing consensus achieved in private across three different involved codebases plus coordinating agreeable backports with stable branch maintainers. I feel like switching to our public workflow and getting more eyes on it is going to be the only way forward.

Gerhard Muntingh (gerhard-1) wrote :

Switching to the public workflow sounds good, but let's make sure the workarounds are good enough en properly documented so openstack admins get a fair chance of closing the vulnerability.

Miguel Angel Ajo (mangelajo) wrote :

Let's move to the public workflow.

There's also a bug in openvswitch that makes it "peg" the CPU when untagged ports exist on the system sometimes...

Jeremy Stanley (fungi) wrote :

Given a week with no objections from the (numerous) subscribers to this bug report as well as the other bug linked by Miguel which sort of makes this issue public already, I'm going to switch it to public triaging the vulnerability report as class B2 for now. If it's determined that sufficient fixes can be implemented in the stable branches of the services/libraries/drivers involved we can revisit as a possible class A vulnerability with accompanying advisory.

description: updated
Changed in ossa:
status: Incomplete → Won't Fix
information type: Private Security → Public
tags: added: security
Miguel Lavalle (minsel) wrote :

@Jeremy,

I just went through the entire series of comments in this bug. I also looked at the bug that Miguel Ajo filed a few days ago (https://bugs.launchpad.net/neutron/+bug/1767422) and the associated information in https://bugzilla.redhat.com/show_bug.cgi?id=1558336. My comments / questions are the following:

1) The work that Miguel Ajo is doing as a consequence of https://bugs.launchpad.net/neutron/+bug/1767422 has to do with the agents and performance issues. Although it is related to what we have been discussing in this bug, it doesn't address it

2) Reading carefully Sean's comment in #43 above, I notice that he states the following: "if you are currently able to transmit/recive packet in a vm that is attached to vlan 4095 i think you have a ovs bug. vlan 4095 is reserved as the null/dead vlan and ovs is drop all packet to/from interfaes with a tag of 4095...". So he seems to have written his comments (specifically the recommendation about ovs-ofctl mod-port down|up) under the assumption that one of the problems to be solved is that VMs can receive traffic even when their interfaces have been put on the dead vlan (4095). As far as I can tell, nobody has reported that here, which means that the patches proposed by Ihar in #37 and #38 are worth trying. Ihar also reports in #40 that the Neutron patch back-ports as is to Ocata and all the way to Liberty with some adjustments. So even though, according to Sean, there are use cases that might not be covered, the patches in #37 and #38 start us in the right direction mitigating the problem. Once they are in place, we can start nibbling at os-vif to work on the use cases that according to Sean are not fixed. I'd be happy to take a stab at that

3) Am I correct in my understanding that at this point we can submit the patches in #37 and #38 under the public / normal process in Gerrit? As soon a I get a confirmation of this, I will submit the patches for review

Miguel Angel Ajo (mangelajo) wrote :

I agree with @Miguel's line of thought here, and on the related patch I'm working on I will add a table 0 rule to drop all VLAN 4095 traffic, although I could keep that as a separate patch If I see that the whole patch doesn't backport far

Since I posted some patches here, to everyone: feel free to take them over and post to gerrit, w/ or w/o attribution. I won't have time to post and polish them for merge. (I am open for doing some reviews though if needed.)

Miguel Angel Ajo (mangelajo) wrote :

+ on I will add a table 0 rule to drop all VLAN 4095 traffic, <-- this doesn't work for traffic handled by NORMAL, yikes. (thanks jakub for the pointer).

thanks @Ihar! ;)

Miguel Lavalle (minsel) wrote :

I tested:

sudo ovs-ofctl mod-port br-int tap903894b4-1f down

and it still receives traffic. So I think the route suggested by Miguel Ajo in #53 is the way to go

Gerhard Muntingh (gerhard-1) wrote :

Any progress on this (now public) security issue?

Fix proposed to branch: master
Review: https://review.openstack.org/594118

Changed in os-vif:
assignee: nobody → Slawek Kaplonski (slaweq)
status: New → In Progress

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/594118
Reason: That isn't good approach for sure. We will have to find something better

Slawek Kaplonski (slaweq) wrote :

Yesterday I talked with Brian about this issue and we though about possible solution only in Neutron.
Maybe change of openflows in br-int can fix this issue.
We can try to use something like:
1. default action in br-int set to DROP
2. For each lvm id (local vlan id used for each network on host) set OF rule with ACTION=normal to proceed such packets in same way as it is now for all rules.

What do You think about such solution?

Download full text (13.2 KiB)

Hi Slawek,

You kind of read my mind, although I was thinking of a little broader
solution.... If you read the notes in the bug, Sean Mooney left a couple of
extensive ones. In a sense, they complicated solving the issue, because he
shone the light on the entire problem, i. e., Nova and os-vif. Why don't we
have a conversation with him in Denver and try to put together a plan to
fix the entire thing?

Best regards

Miguel

On Tue, Sep 11, 2018 at 9:20 AM, Slawek Kaplonski <
<email address hidden>> wrote:

> Yesterday I talked with Brian about this issue and we though about
> possible solution only in Neutron.
> Maybe change of openflows in br-int can fix this issue.
> We can try to use something like:
> 1. default action in br-int set to DROP
> 2. For each lvm id (local vlan id used for each network on host) set OF
> rule with ACTION=normal to proceed such packets in same way as it is now
> for all rules.
>
> What do You think about such solution?
>
> --
> You received this bug notification because you are a member of Neutron
> Core Security reviewers, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1734320
>
> Title:
> Eavesdropping private traffic
>
> Status in neutron:
> Triaged
> Status in OpenStack Compute (nova):
> Confirmed
> Status in os-vif:
> In Progress
> Status in OpenStack Security Advisory:
> Won't Fix
>
> Bug description:
> Eavesdropping private traffic
> =============================
>
> Abstract
> --------
>
> We've discovered a security issue that allows end users within their
> own private network to receive from, and send traffic to, other
> private networks on the same compute node.
>
> Description
> -----------
>
> During live-migration there is a small time window where the ports of
> instances are untagged. Instances have a port trunked to the
> integration bridge and receive 802.1Q tagged private traffic from
> other tenants.
>
> If the port is administratively down during live migration, the port
> will remain in trunk mode indefinitely.
>
> Traffic is possible between ports is that are administratively down,
> even between tenants self-service networks.
>
> Conditions
> ----------
>
> The following conditions are necessary.
>
> * Openvswitch Self-service networks
> * An Openstack administrator or an automated process needs to schedule a
> Live migration
>
> We tested this on newton.
>
> Issues
> ------
>
> This outcome is the result of multiple independent issues. We will
> list the most important first, and follow with bugs that create a
> fragile situation.
>
> Issue #1 Initially creating a trunk port
>
> When the port is initially created, it is in trunk mode. This creates a
> fail-open situation.
> See: https://github.com/openstack/os-vif/blob/newton-eol/vif_
> plug_ovs/linux_net.py#L52
> Recommendation: create ports in the port_dead state, don't leave it
> dangling in trunk-mode. Add a drop-flow initially.
>
> Issue #2 Order of creation.
>
> The instance is actually migrated before the (networking)
> configuration is completed.
>
> Recommendation: wait with finishing the live migration until the
> un...

Fix proposed to branch: master
Review: https://review.openstack.org/602384

Changed in os-vif:
assignee: Slawek Kaplonski (slaweq) → sean mooney (sean-k-mooney)

Fix proposed to branch: master
Review: https://review.openstack.org/602432

Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
status: Confirmed → In Progress

Change abandoned by sean mooney (<email address hidden>) on branch: master
Review: https://review.openstack.org/602432
Reason: this should not be needed

sean mooney (sean-k-mooney) wrote :

Note a CVE has been issued for this bug in on 2018-07-27 http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-14636

as a quick status update on this bug.

first i want to mention that a second variant of this issue has also been highlight to me.
i will be filing a separate private bug for that variant as it will require changes outside of os-vif
to address and will be more involved. the second variant is also not covered by the existing cve description so i want to track it separately.

regarding the state of this bug i have proposed
https://review.openstack.org/602432 to os-vif master and a stable/rocky backport https://review.openstack.org/609850. i am in the process of rebaseing that backport
and expect both to be viable to merge later today.

There are some limitations however this this fix.
- First for it to work corectly this will require the multiple port binding support added in rocky to nova. As such on stable/queens and older branches the backport is only a partial mitigation as nova will not wait for neutron to signel it has finished wiring up the port before resuming the guest on the destination node.

- Second If the neutron ovs ml2 agent crashes after the ml2 driver binds the port on the destination node but before the ml2 agents wires up the port addded to ovs by os-vif then we cannot detect this from nova/os-vif and the mitigation will not be effective. as the neutron control plane would be in an undefined state if its agents crashed/exited on the compute node i feel its fair to declare this limitation out of scope of this current bug.

It may be possible to address one or both of these limitation as part of the the second variant's mitigation however it will require more extensive modification to nova, neutron and os-vif which are unlikely to be back portable easily as they may require a minor backwards compatible api extension.

Change abandoned by sean mooney (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/609851
Reason: ill backport a seprerate reivew as master has chaged significantly form this version.

Fix proposed to branch: master
Review: https://review.openstack.org/616609

Changed in neutron:
assignee: nobody → sean mooney (sean-k-mooney)
status: Triaged → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.