Eavesdropping private traffic

Bug #1734320 reported by Paul Peereboom
50
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
sean mooney
OpenStack Security Advisory
Undecided
Unassigned
neutron
High
Rodolfo Alonso
os-vif
High
sean mooney

Bug Description

Eavesdropping private traffic
=============================

Abstract
--------

We've discovered a security issue that allows end users within their own private network to receive from, and send traffic to, other private networks on the same compute node.

Description
-----------

During live-migration there is a small time window where the ports of instances are untagged. Instances have a port trunked to the integration bridge and receive 802.1Q tagged private traffic from other tenants.

If the port is administratively down during live migration, the port will remain in trunk mode indefinitely.

Traffic is possible between ports is that are administratively down, even between tenants self-service networks.

Conditions
----------

The following conditions are necessary.

* Openvswitch Self-service networks
* An Openstack administrator or an automated process needs to schedule a Live migration

We tested this on newton.

Issues
------

This outcome is the result of multiple independent issues. We will list the most important first, and follow with bugs that create a fragile situation.

Issue #1 Initially creating a trunk port

When the port is initially created, it is in trunk mode. This creates a fail-open situation.
See: https://github.com/openstack/os-vif/blob/newton-eol/vif_plug_ovs/linux_net.py#L52
Recommendation: create ports in the port_dead state, don't leave it dangling in trunk-mode. Add a drop-flow initially.

Issue #2 Order of creation.

The instance is actually migrated before the (networking) configuration is completed.

Recommendation: wait with finishing the live migration until the underlying configuration has been applied completely.

Issue #3 Not closing the port when it is down.

Neutron calls the port_dead function to ensure the port is down. It sets the tag to 4095 and adds a "drop" flow if (and only if) there is already another tag on the port. The port_dead function will keep untagged ports untagged.

https://github.com/openstack/neutron/blob/stable/newton/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L995

Recommendation: Make port_dead also shut the port if no tag is found. Log a warning if this happens.

Issue #4 Putting the port administratively down actually puts the port on a compute node shared vlan

Instances from different projects on different private networks can talk to each other if they put their ports down. The code does install an openflow "drop" rule but it has a lower priority (2) than the allow rules.

Recommendation: Increase the port_dead openflow drop rule priority to MAX

Timeline
--------

 2017-09-14 Discovery eavesdropping issue
 2017-09-15 Verify workaround.
 2017-10-04 Discovery port-down-traffic issue
 2017-11-24 Vendor Disclosure to Openstack

Steps to reproduce
------------------

1. Attach an instance to two networks:

admin$ openstack server create --nic net-id=<net-uuid1> --nic net-id=<net-uuid2> --image <image_id> --flavor <flavor_id> instance_temp

2. Attach a FIP to the instance to be able to log in to this instance

3. Verify:

admin$ openstack server show -c name -c addresses fe28a2ee-098f-4425-9d3c-8e2cd383547d

+-----------+-----------------------------------------------------------------------------+
| Field | Value |
+-----------+-----------------------------------------------------------------------------+
| addresses | network1=192.168.99.8, <FIP>; network2=192.168.80.14 |
| name | instance_temp |
+-----------+-----------------------------------------------------------------------------+

4. Ssh to the instance using network1 and run a tcpdump on the other port network2

[root@instance_temp]$ tcpdump -eeenni eth1

5. Get port-id of network2

admin$ nova interface-list fe28a2ee-098f-4425-9d3c-8e2cd383547d
+------------+--------------------------------------+--------------------------------------+---------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+---------------+-------------------+
| ACTIVE | a848520b-0814-4030-bb48-49e4b5cf8160 | d69028f7-9558-4f14-8ce6-29cb8f1c19cd | 192.168.80.14 | fa:16:3e:2d:8b:7b |
| ACTIVE | fad148ca-cf7a-4839-aac3-a2cd8d1d2260 | d22c22ae-0a42-4e3b-8144-f28534c3439a | 192.168.99.8 | fa:16:3e:60:2c:fa |
+------------+--------------------------------------+--------------------------------------+---------------+-------------------+

6. Force port down on network 2

admin$ neutron port-update a848520b-0814-4030-bb48-49e4b5cf8160 --admin-state-up False

7. Port gets tagged with vlan 4095, the dead vlan tag, which is normal:

compute1# grep a848520b-0814-4030-bb48-49e4b5cf8160 /var/log/neutron/neutron-openvswitch-agent.log | tail -1
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-e008feb3-8a35-4c97-adac-b48ff88165b2 - - - - -] VIF port: a848520b-0814-4030-bb48-49e4b5cf8160 admin state up disabled, putting on the dead VLAN

8. Verify the port is tagged with vlan 4095

compute1# ovs-vsctl show | grep -A3 qvoa848520b-08
      Port "qvoa848520b-08"
          tag: 4095
          Interface "qvoa848520b-08"

9. Now live-migrate the instance:

admin# nova live-migration fe28a2ee-098f-4425-9d3c-8e2cd383547d

10. Verify the tag is gone on compute2, and take a deep breath

compute2# ovs-vsctl show | grep -A3 qvoa848520b-08
      Port "qvoa848520b-08"
          Interface "qvoa848520b-08"
      Port...
compute2# echo "Wut!"

11. Now traffic of all other self-service networks present on compute2 can be sniffed from instance_temp

[root@instance_temp] tcpdump -eenni eth1
13:14:31.748266 fa:16:3e:6a:17:38 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.152, length 28
13:14:31.804573 fa:16:3e:e8:a2:d2 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.70, length 28
13:14:31.810482 fa:16:3e:95:ca:3a > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.154, length 28
13:14:31.977820 fa:16:3e:6f:f4:9b > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.150, length 28
13:14:31.979590 fa:16:3e:0f:3d:cc > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 9, p 0, ethertype ARP, Request who-has 10.103.9.163 tell 10.103.9.1, length 28
13:14:32.048082 fa:16:3e:65:64:38 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.101, length 28
13:14:32.127400 fa:16:3e:30:cb:b5 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.165, length 28
13:14:32.141982 fa:16:3e:96:cd:b0 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.100, length 28
13:14:32.205327 fa:16:3e:a2:0b:76 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.153, length 28
13:14:32.444142 fa:16:3e:1f:db:ed > 01:00:5e:00:00:12, ethertype 802.1Q (0x8100), length 58: vlan 72, p 0, ethertype IPv4, 192.168.99.212 > 224.0.0.18: VRRPv2, Advertisement, vrid 50, prio 103, authtype none, intvl 1s, length 20
13:14:32.449497 fa:16:3e:1c:24:c0 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.20, length 28
13:14:32.476015 fa:16:3e:f2:3b:97 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33, p 0, ethertype ARP, Request who-has 10.0.1.9 tell 10.0.1.22, length 28
13:14:32.575034 fa:16:3e:44:fe:35 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.163, length 28
13:14:32.676185 fa:16:3e:1e:92:d7 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.103.12.160 tell 10.103.12.150, length 28
13:14:32.711755 fa:16:3e:99:6c:c8 > 01:00:5e:00:00:12, ethertype 802.1Q (0x8100), length 62: vlan 10, p 0, ethertype IPv4, 10.103.12.154 > 224.0.0.18: VRRPv2, Advertisement, vrid 2, prio 49, authtype simple, intvl 1s, length 24
13:14:32.711773 fa:16:3e:f5:23:d5 > 01:00:5e:00:00:12, ethertype 802.1Q (0x8100), length 58: vlan 12, p 0, ethertype IPv4, 10.103.15.154 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 49, authtype simple, intvl 1s, length 20

Workaround
----------

We temporary fixed this issue by forcing the dead vlan tag on port creation on compute nodes:

/usr/lib/python2.7/site-packages/vif_plug_ovs/linux_net.py:

def _create_ovs_vif_cmd(bridge, dev, iface_id, mac,
                        instance_id, interface_type=None,
                        vhost_server_path=None):
+ # ODCN: initialize port as dead
+ # ODCN: TODO: set drop flow
    cmd = ['--', '--if-exists', 'del-port', dev, '--',
            'add-port', bridge, dev,
+ 'tag=4095',
            '--', 'set', 'Interface', dev,
            'external-ids:iface-id=%s' % iface_id,
            'external-ids:iface-status=active',
            'external-ids:attached-mac=%s' % mac,
            'external-ids:vm-uuid=%s' % instance_id]
    if interface_type:
        cmd += ['type=%s' % interface_type]
    if vhost_server_path:
        cmd += ['options:vhost-server-path=%s' % vhost_server_path]
    return cmd

https://github.com/openstack/neutron/blob/stable/newton/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L995

    def port_dead(self, port, log_errors=True):
        '''Once a port has no binding, put it on the "dead vlan".

        :param port: an ovs_lib.VifPort object.
        '''
        # Don't kill a port if it's already dead
        cur_tag = self.int_br.db_get_val("Port", port.port_name, "tag",
                                         log_errors=log_errors)
+ # ODCN GM 20170915
+ if not cur_tag:
+ LOG.error('port_dead(): port %s has no tag', port.port_name)
+ # ODCN AJS 20170915
+ if not cur_tag or cur_tag != constants.DEAD_VLAN_TAG:
- if cur_tag and cur_tag != constants.DEAD_VLAN_TAG:
           LOG.info('port_dead(): put port %s on dead vlan', port.port_name)
           self.int_br.set_db_attribute("Port", port.port_name, "tag",
                                         constants.DEAD_VLAN_TAG,
                                         log_errors=log_errors)
            self.int_br.drop_port(in_port=port.ofport)

plugins/ml2/drivers/openvswitch/agent/openflow/ovs_ofctl/ovs_bridge.py
    def drop_port(self, in_port):
+ # ODCN AJS 20171004:
- self.install_drop(priority=2, in_port=in_port)
+ self.install_drop(priority=65535, in_port=in_port)

Regards,

ODC Noord.
Gerhard Muntingh
Albert Siersema
Paul Peereboom

CVE References

Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

Changed in ossa:
status: New → Incomplete
description: updated
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Looking.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

My fear is that this vulnerability might be present on releases newer than Newton.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

The other thing that surprised me about this is that there are rules forwarding traffic to the port when it doesn't have a VLAN tag. Is the behavior of the NORMAL action to flood to all untagged ports as if they were trunks carrying all VLANs?

The other suggested fixes all seem to be working under the assumption that traffic will flow to ports if no rules are installed. I think it would be safest to ensure that a newly added port to br-int doesn't receive any traffic until flows are setup. Going that route would also avoid having to ban all earlier versions of os-vif because a VLAN tag would not need to be set on plugging.

Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Should we subscribe the Nova project for the issue #2 (Order of creation)?

Also, isn't instance migration an admin only operation? If so, can a regular user abuse this bug without an admin cooperation?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Tristan: perhaps looping in the nova project is not a bad idea, and yes, you're right that migration can only be performed by an admin according to the default policy.json.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Today I was looking at this a bit to see if 'resizing' would expose the same behavior, as that is available to regular users. I shall be able to find more time tomorrow to assess what's going on pike and ocata.

Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Thanks Armando for the prompt feedback. I've subscribed nova-coresec to discuss the scope of this vulnerability, in particular the issue #2.

@nova-coresec, is it correct to assume migration operation are restricted to admin? If a deployment does authorize regular user to do migration operation, wouldn't they be vulnerable to other unexpected issues anyway?

Revision history for this message
Paul Peereboom (peereb) wrote :

@Tristan: yes nova live-migration is by default an admin only operation. But forcing the port-down can be done by a user aswell. Then it's just waiting for an admin to do a live-migration of a instance. We live-migrate instances regularly to empty a compute node and patch the compute node.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Block migration would also be affected by the same behavior.

Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

@Paul, well I tend to agree this is a vulnerability (class A according to the VMT taxonomy: http://security.openstack.org/vmt-process.html#incident-report-taxonomy ). However, since a malicious user can't control the condition of exploitation, other VMT member may disagree with issuing an advisory for this issue.

I guess it boils down to how likely a user can snoop sensitive traffic.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

That's why I am trying to verify that the resize operation has the same behavior, because if that's the case, the user is very much in control!

Revision history for this message
Paul Peereboom (peereb) wrote :

Agree lets see if we can reproduce without admin credentials.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I used a Pike-based deployment as a testing platform and I can confirm that the resize operation exposes the same behavior as live/block migration (as reported in this report), i.e. the port connected to the VM shows up without a local VLAN tag on the target host.

I still have to look at the extent of the damage that the user can inflict. So, it looks like the malicious user can be in full control of the condition of exploitation.

Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Thanks Armando,

so the next question is, can this be fixed across all the supported stable release? (e.g.: pike and ocata) ?
And is there something to fix in the os-vif and nova project too?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Hi Tristan,

I wanted to look a bit deeper at what happens on the datapath, but I keep getting sidetracked. The report and analysis from Paul has been excellent, and I welcome that level of detail very much, but I want to see this through a bit more to formulate a recommendation.

Thanks for everyone involved!

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

OK, I finally had some interrupted time in which I could look into this more carefully. the TL;DR summary is: the behavior is exposed in live-migrate/block-migrate/resize operations. That said, I don't believe this vulnerability is serious at least under default conditions. Here's why:

1) live/block migration is an admin operation: if the admin avoids turning a port down, the vulnerability is not exposed. Turning a port down is not strictly necessary. Furthermore block migration is disruptive, in that users loses console and network connectivity to the instance.

2) resize is a user allowed operation, but the operation leads to traffic disruption. Even though a user can explicitly turn a port DOWN and resize the instance without admin intervention, there's no way she can keep connectivity to the instance while the operation is in progress. Turning the port back UP will reinstate the local VLAN tag and restore connectivity. The loophole at that point is closed.

For these reasons, I would cautiously say that this vulnerability is not easily exploitable, but we do want to warn the admin of the potential loophole (OSSN B2?), and eventually address the neutron OVS agent codebase to ensure that a port is always tagged even when it's in ADMIN_DOWN after a migration/resize operation.

I'd recommend against changing os-vif to start a port on the dead vlan for two reasons:

a) neutron use cases go beyond just spinning up VMs. So if possible we should find a fix confined within neutron alone.

b) I feel that the os-vif fix as proposed is ineffective in closing any timing window because while the port is untagged, it's practically unlikely that a user can get hold of the console remotely.

I am open to feedback on my recommendation, but in the meantime I would like to thank the reporter for the very detailed, well thought and challenging report. I had fun cracking this one. Keep them coming!

Needless to say, open to further discussion.

@Tristan: what's next steps? Is this OSSA then marked CONFIRMED? I would like to involve Ihar and Miguel for what comes next. Unless my rationale is flawed, I consider my triage complete and I'd like to push the ball into someone else's court.

Cheers,
Armando

Changed in neutron:
status: New → Triaged
Changed in nova:
status: New → Confirmed
Changed in neutron:
importance: Undecided → Low
Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Thank you Armando for this thorough investigation.

You mention console access as a requirement, but couldn't the user run nohup task in the background to capture the traffic, and then retrieve the data later?

To answer your last question, if this is triaged as B2, then we would subscribe the ossg-coresec group so that an OSSN could be prepared and may be sent to the downstream stakeholders through the embargo-notice mailing list before disclosing this bug. Though the live migration case may be bad enough (e.g.: for operator going through that process often) that we may want to issue an advisory instead.

Please feel free to subscribe more neutron or nova developers to this bug, at least to check for unforeseen scenarios.

Revision history for this message
Gerhard Muntingh (gerhard-1) wrote :

Hi Armando, I'm also impressed by the thorough investigation. Especially the resize insight.

Please carefully read point two below. I think the issue is more serious because of it.

1) Turning the port down (as a user) is not necessary, but it changes the trunk-mode time from a couple of seconds to indefinitely.

2) There's no need to put the port (eth0) back up, if the attacker creates a secondary network interface (eth1) to access the instance. No console access is needed. This is the described scenario in the original report.

Thanks,
Gerhard.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I didn't mean that console would be a requirement for the exploit, and indeed I overlooked the two-NICs scenario. Thanks for your feedback...I feared I was missing something ;)

That said, I have been thinking if ensuring a tag on the port after resize/migration operation suffices to mitigate the exploitation.

Ihar identified snippet of code [1] as potential contentious, and I wonder if along location [2] would be the only two areas required to be hardened.

Thoughts?
Thanks,
Armando

[1] https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1426-L1436
[2] https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L995

Revision history for this message
Gerhard Muntingh (gerhard-1) wrote :

Hardening both locations doesn't fix the issue. It brings the time window back to a couple of seconds.

Out of curiosity, can't nova-compute skip the os-vif step, and have neutron configure the port entirely?

Looping will also work.

Also, don't forget the less serious issue #4.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I agree with Gerhard on the fact that hardening on neutron side won't completely eliminate the issue in nova/neutron setup. We should also have a fix on os-vif side where we enforce dead tag on newly created ports. AFAIU os-vif may not have been used before recent releases of Nova so the fix may need to be both in os-vif and nova stable branches.

As for #4, I think it will not be an issue if we enforce dead tag when putting a port down. (The problem is currently we don't do it for unbound ports.)

If I try to capture all the things we may need to patch, I get the following:
- (neutron) port_dead to always set dead tag on new unbound ports;
- (os-vif/nova) enforce dead tag for new ovs ports.

The first patch fixes the issue on neutron side for setups that are not using Nova / os-vif. In those setups, components calling to Neutron should make sure newly created ports are dead, otherwise they are still exposed to a (short) vulnerable window. (Should we include this info in a release note / security report? Should we reach out to Neutron API consumers that may be affected?)

The second fix will completely close the short vulnerable window in neutron/nova setups discussed above.

(I actually feel that those two issues are independent and should be treated as two separate CVEs targeting different components.)

Then additional hardening patches could also be:
- (neutron) use drop rule for normal action on br-int; (may be invasive / breaking external code);
(and/or)
- (neutron) bump priority for drop flow rule set for dead ports.

The hardening may happen in public since neither of those issues should expose anything as long as os-vif/neutron patches as described before.

Does it make sense?

Is anyone from Nova team aware of the issue?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Ihar, I agree with your assessment, thanks for pushing this forward; I am a bit uneasy about the implications of adding a drop rule for normal actions on br-int for the reasons you've stated (maybe that breaks Tap as a Service or SFC or other stuff?). We should also make sure that the change on the os-vif side will not break things like OVN/ODL or other drivers that rely on OVS because I don't believe there's any local VLAN tagging involved for those. Have you considered these scenarios?

To answer Gerhard's question about os-vif, iirc the rationale of having nova do the plugging is to enable scenarios where there's no network agent involved on the neutron side and/or to abstract the networking service from hypervisor internals.

Revision history for this message
Gerhard Muntingh (gerhard-1) wrote :

Thanks, that makes sense.

I suspect adding a drop rule for normal action on br-int will require quite some explicit flows to deal with all the other bridges (br-ex), but we'll see.

On neutron hardening: Bumping priority for drop flow rule set for dead ports will fix issue #4.

Revision history for this message
Miguel Lavalle (minsel) wrote :

Thanks to everybody who has helped in diagnosing this issue. It seems we have two sets of actions:

1) Fixing: that involves port_dead to always set dead tag on new unbound ports in Neutron and enforce dead tag for new ovs ports in nova / os-vif

2) Hardening of Neutron code

I suggest that as the very next step we concentrate in the fixing part. For this the Neutron side seems to be uncontroversial. The only sticking point left is the potential breakage of other back-ends with the os-vif fix. Can we involve someone from OVN to see what the implications might be?

Revision history for this message
Jeremy Stanley (fungi) wrote :

Please subscribe anyone you think may be able to help identify a safe solution.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I subscribed Miguel Angel Ajo from networking-ovn team to assess whether proposed os-vif fix would break them.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

(something we discussed with Armando yesterday)

if for some reason the suggested change to os-vif to mark all new ports with dead tag is breaking other projects, we may roll it in as opt-in, where consumer passes some flag through binding profile dictionary that would instruct nova to also mark a new port as dead; then the flag would be proxied into os-vif binding code that would enact the request. If the flag was not set / passed disabled, then nova would not request port death, and os-vif would do the same action as it does right now.

There are several compatibility issues to be concerned about here.

1. we effectively change api payload for binding extension; but because we already treat profile dict as a large box of random crap to pass around between neutron and nova, unpatched nova should already be ready to receive a dict with unknown keys.

2. it may be that os-vif is not patched while nova is, which would potentially lead to binding call to os-vif failing because of unknown flag passed by nova. I don't think we can bump minimal supported os-vif version in requirements.txt, so nova will need to find a way to deal with older releases. If nothing else, we can always catch TypeError: ... expected at least X arguments, got Y error and fall back to flag-less call to os-vif (plus a warning).

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This ^ would of course imply that instead of having one/two places to patch (os-vif + maybe neutron), we will have three (neutron -> nova -> os-vif).

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Hi,

Thank you very much for looping me in,

I'm not sure if networking-ovn, or other plugins that don't use tags+NORMAL-rules in OVS would work with the os-vif change (which as far as I undertood creates the ports right away in the dead vlan).

Could we also loop Numan Siddique to verify how ovn behaves in this case?, tomorrow we have a bank holiday in Spain, and I will be physically away from my computer until late.

The question is if the "vlan tag" will be just ignored based on the used OpenFlow rules, or what happens exactly.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Miguel: my request from comment #26 definitely applies to you as well. Please subscribe anyone you think may be able to help identify a safe solution.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Sorry, I didn't know that I had permissions to add more people, I did it with numan, but I will verify now that I'm back from PTO

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

[root@primary ~]# ovs-vsctl show
a6b77a09-2647-47d7-8815-ef4d4f689ce8
    Bridge br-int
        fail_mode: secure
        Port "tap5a27427b-22"
            Interface "tap5a27427b-22"
        Port br-int
            Interface br-int
                type: internal
        Port "patch-br-int-to-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e"
            Interface "patch-br-int-to-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e"
                type: patch
                options: {peer="patch-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e-to-br-int"}
        Port "tapa4e1ef4d-40"
            tag: 4095
            Interface "tapa4e1ef4d-40"
        Port "tap4b56611b-99"
            tag: 4095
            Interface "tap4b56611b-99"
        Port "tap473919fe-31"
            tag: 4095
            Interface "tap473919fe-31"
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port "patch-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e-to-br-int"
            Interface "patch-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e-to-br-int"
                type: patch
                options: {peer="patch-br-int-to-provnet-d556080a-799f-4621-bb2d-d4ac9b8bb32e"}
[root@primary ~]# ip netns exec ovnmeta-a4e1ef4d-47ce-4b92-8043-70d88237eff1 ssh cirros@10.0.0.4
cirros@10.0.0.4's password:

[root@primary ~]# ip netns exec ovnmeta-a4e1ef4d-47ce-4b92-8043-70d88237eff1 ssh cirros@10.0.0.6
The authenticity of host '10.0.0.6 (10.0.0.6)' can't be established.
RSA key fingerprint is SHA256:cDLkQEB0LfxZfIvpd084MucUa4uohUd0COf3ArPa1A0.
RSA key fingerprint is MD5:cd:4e:f7:f5:e1:bb:61:ea:a7:4d:46:f8:67:43:20:00.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.0.6' (RSA) to the list of known hosts.
cirros@10.0.0.6's password:

[root@primary ~]# ip netns exec ovnmeta-a4e1ef4d-47ce-4b92-8043-70d88237eff1 ssh cirros@10.0.0.8
The authenticity of host '10.0.0.8 (10.0.0.8)' can't be established.
RSA key fingerprint is SHA256:XUX8EfLF2oRJLqDChEEw3smHGeHm7zcQapdpayZcb0Y.
RSA key fingerprint is MD5:33:00:41:fb:96:25:a4:79:b5:2e:63:c8:03:8f:2e:be.
Are you sure you want to continue connecting (yes/no)? ^C
[root@primary ~]#
[root@primary ~]#

Sorry for the delay, I've verified that the solution would be fine for OVN (and I suspect that also for other openflow based solutions which don't use the "NORMAL" rule).

Revision history for this message
Miguel Lavalle (minsel) wrote :

@Tocayo Miguel Ajo,

Thanks for the testing. Much appreciated

Based on this I think we can move ahead with the "fixing" part, which involves:

1) port_dead to always set dead tag on new unbound ports in Neutron
2) Enforce dead tag for new ovs ports in nova / os-vif

What are the mechanics of creating, reviewing and merging these two patches under the embargo conditions of this CVE? I am available to work on this

I'll still loop in someone from ODL to make sure they are fine with the proposed approach

Revision history for this message
Jeremy Stanley (fungi) wrote :

If the issue is considered widespread, easy enough to exploit and able to do a sufficient amount of damage or obtain sensitive information with some certainty, then we would want to keep this report privately embargoed while fixes are developed, attached to this bug and tested/reviewed locally by the bug's subscribers. Once working fixes for master are identified and backports are created and attached to this bug for all supported stable branches, we will get a CVE identifier assigned, propose a disclosure schedule and provide advance notice and copies of the patches privately to downstream stakeholders.

If the risk from this bug is of questionable severity, we should subscribe the ossg-coresec team to get their opinion on the possibility of switching to our (simpler) public workflow for this report instead.

As a reminder, our process for both of these options is described here: https://security.openstack.org/vmt-process.html

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I tried to reach out to Matt (Nova PTL) via email, so far no luck. I think their involvement is crucial here.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This is the neutron piece of the puzzle to make ovs agent mark all untagged ports as dead. I am considering adding a new functional or fullstack test for the scenario too, but posting what I have for now to start review.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

In nova, there is also some code in nova/network/linux_net.py (specifically, class LinuxOVSInterfaceDriver) that seems to have plug() method that calls add-port and would benefit from dead tag setting too, but I failed to find a place where this code is hooked into actual services, so I wonder if this code is not some dead end that will be cleaned up in due course after os-vif switch. I will leave it up to nova team to decide.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Backporting for neutron seems simple: the patch applies cleanly to stable/ocata which is the oldest version upstream supports.

Since I also have an interest in earlier releases that may still be supported by Red Hat, I also checked that the patch applies with some adjustments up to liberty. As for earlier releases, they are probably not affected because they don't include https://review.openstack.org/#/q/I5ef9665770df3a9bbaf79049b219fadd73e20309 that made neutron ovs agent skip tagging ports as dead if they don't have a tag in the first place.

Which makes me think that we should make sure that the CVE fix doesn't break the assumptions that the patch I linked to above made. The concern there is, as far as I understand, is that drop flows are left on the bridge even after port is gone, which may hinder performance etal.

Changed in neutron:
importance: Low → High
Jeremy Stanley (fungi)
description: updated
Changed in ossa:
status: Incomplete → Won't Fix
information type: Private Security → Public
tags: added: security
Changed in os-vif:
assignee: nobody → Slawek Kaplonski (slaweq)
status: New → In Progress
Changed in os-vif:
assignee: Slawek Kaplonski (slaweq) → sean mooney (sean-k-mooney)
Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
status: Confirmed → In Progress
Changed in neutron:
assignee: nobody → sean mooney (sean-k-mooney)
status: Triaged → In Progress
Changed in os-vif:
importance: Undecided → High
status: In Progress → Fix Committed
31 comments hidden view all 111 comments
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/616609
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e3dc447b908f57e9acc0378111b8e09cbd88ddc5
Submitter: Zuul
Branch: master

commit e3dc447b908f57e9acc0378111b8e09cbd88ddc5
Author: Sean Mooney <email address hidden>
Date: Thu Nov 8 16:07:55 2018 +0000

    raise priority of dead vlan drop

    - This change adds a max priority flow to drop
      all traffic that is associated with the
      DEAD VLAN 4095.
    - This change is part of a partial mitigation of
      bug 1734320. Without this change vlan 4095 traffic
      will be dropped via a low priority flow after being
      processed by part/all of the openflow pipeline.
      By raising the priorty and droping in table 0
      we drop invalid packets as soon as they enter
      the pipeline.

    Change-Id: I3482c7c4f00942828cc9396cd2f3d646c9e8c9d1
    Partial-Bug: #1734320

tags: added: neutron-proactive-backport-potential
Changed in os-vif:
status: Fix Committed → Fix Released
Changed in neutron:
status: In Progress → Fix Committed
Changed in nova:
status: In Progress → Won't Fix
Revision history for this message
sean mooney (sean-k-mooney) wrote :

hi just updating this.
the os-vif fix was released in 1.13.0 but that had a file Handel leak so
1.13.1 is the minium version that should be used to receive this fix.

the neutron change has been commited to master but not released.
both the neutron and os-vif change should be backported.

the nova change was not required so i have set it to wont fix.

the code that is currently commited should be suffient to resovle the bug however there are a number of related changes that i have not written yet that would make this more robost in general.
those changes will likely have to wait until Train to implement as i likely wont have time in stein.

the two addtional changes in neutron would be extending neutron to record the ml2 diriver that bound the port in then neutron port binding_detils and recorded in the ml2 driver send vif plug enent on port bind or on interface plug.

in nova i would like to modify it to pass this info to os-vif so that it can be used by the os-vif plugins to dynamically enable isolation when ml2/ovs is used.
i would also like to use use the info regarding when vif plug events are sent so we can more correctly Handel them in nova.

none of these additional changes are required to fix this bug however the will harden the interface between nova and neutron and should help prevent related bugs from being introduced in the future.

Changed in nova:
status: Won't Fix → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/635083

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/640258

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/640258
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on os-vif (stable/queens)

Change abandoned by sean mooney (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/609851

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on os-vif (stable/rocky)

Change abandoned by sean mooney (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/609850

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by sean mooney (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/616285

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by sean mooney (<email address hidden>) on branch: master
Review: https://review.opendev.org/635083
Reason: we have changed the direction somewhat so this is not needed in it current form.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/686345

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/686346

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/686347

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/686345
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ffee956d44da34e38e62f82a3bc676cdd6179f48
Submitter: Zuul
Branch: stable/rocky

commit ffee956d44da34e38e62f82a3bc676cdd6179f48
Author: Sean Mooney <email address hidden>
Date: Thu Nov 8 16:07:55 2018 +0000

    raise priority of dead vlan drop

    - This change adds a max priority flow to drop
      all traffic that is associated with the
      DEAD VLAN 4095.
    - This change is part of a partial mitigation of
      bug 1734320. Without this change vlan 4095 traffic
      will be dropped via a low priority flow after being
      processed by part/all of the openflow pipeline.
      By raising the priorty and droping in table 0
      we drop invalid packets as soon as they enter
      the pipeline.

    Change-Id: I3482c7c4f00942828cc9396cd2f3d646c9e8c9d1
    Partial-Bug: #1734320
    (cherry picked from commit e3dc447b908f57e9acc0378111b8e09cbd88ddc5)

tags: added: in-stable-rocky
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Should this bug be reopened for os-vif now when revert https://review.opendev.org/#/c/631829 was merged?

@sean-k-mooney can you please share that status of this bug? Is there anything except neutron https://review.opendev.org/#/c/640258 and nova https://review.opendev.org/#/c/602432/ changes pending?

Revision history for this message
sean mooney (sean-k-mooney) wrote :

there is a mitagation for this bug in all cases except when the ovs firewall driver is used with kernel ovs.
https://review.opendev.org/#/c/631829 has not been reverted and there is no plans to revert it.
it.

https://review.opendev.org/#/c/612534/ was intoduced to add a config option to enable isolation.
https://review.opendev.org/#/c/636061/ allows the caller of os-vif to determin if os-vif should plug the interface to the network backend.

the nova change uses that abilty to delegate the plugging to os-vif instead of libvirt
https://review.opendev.org/#/c/602432/13/nova/network/os_vif_util.py

but we cannot do that until the neutron change is merged https://review.opendev.org/#/c/640258

i am not really activly workign on either patch right now.
i tried to repoduce the dvr failutre locally but in my env it seams to work fine.
we know from the upstream testing that in a non dvr env this seams to work fine.
if some neutron dvr folks can try and fix the dvr issue that woudl move things forward
but a few neutorn review have looked and we are not sure why this is broken.

i think the issue is here https://review.opendev.org/#/c/640258/15/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py@404
we are not looking at the correct field however i have not had tiem to actully debug this as i have been working on other issue.

so to summarize the status
form an os-vif point of view i consider this bug to be fixed.
the nova fix is currently blocked by dvr support in the neutron patch.

if you are using a configuration other than kernel ovs with the ovs firewall driver we believe this bug is fixed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/686346
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9b0919e64809729f4c0186e9be63b1082510bbb1
Submitter: Zuul
Branch: stable/queens

commit 9b0919e64809729f4c0186e9be63b1082510bbb1
Author: Sean Mooney <email address hidden>
Date: Thu Nov 8 16:07:55 2018 +0000

    raise priority of dead vlan drop

    - This change adds a max priority flow to drop
      all traffic that is associated with the
      DEAD VLAN 4095.
    - This change is part of a partial mitigation of
      bug 1734320. Without this change vlan 4095 traffic
      will be dropped via a low priority flow after being
      processed by part/all of the openflow pipeline.
      By raising the priorty and droping in table 0
      we drop invalid packets as soon as they enter
      the pipeline.

    Conflicts:
        neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/openflow/native/test_br_int.py

    Change-Id: I3482c7c4f00942828cc9396cd2f3d646c9e8c9d1
    Partial-Bug: #1734320
    (cherry picked from commit e3dc447b908f57e9acc0378111b8e09cbd88ddc5)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.opendev.org/686347
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=797a379c51a05fca356d2f2575bfc386b0287afd
Submitter: Zuul
Branch: stable/pike

commit 797a379c51a05fca356d2f2575bfc386b0287afd
Author: Sean Mooney <email address hidden>
Date: Thu Nov 8 16:07:55 2018 +0000

    raise priority of dead vlan drop

    - This change adds a max priority flow to drop
      all traffic that is associated with the
      DEAD VLAN 4095.
    - This change is part of a partial mitigation of
      bug 1734320. Without this change vlan 4095 traffic
      will be dropped via a low priority flow after being
      processed by part/all of the openflow pipeline.
      By raising the priorty and droping in table 0
      we drop invalid packets as soon as they enter
      the pipeline.

    Conflicts:
        neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/openflow/native/test_br_int.py

    Change-Id: I3482c7c4f00942828cc9396cd2f3d646c9e8c9d1
    Partial-Bug: #1734320
    (cherry picked from commit e3dc447b908f57e9acc0378111b8e09cbd88ddc5)

tags: added: in-stable-pike
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Copying here my comment on neutron patch https://review.opendev.org/640258/: not sure the failures are related to DVR: it seems we only run full tempest set with dvr enabled (with OVS). Test failures are mostly about VM not becoming ACTIVE (stuck in BUILD), I guess due to neutron VM port not becoming ACTIVE. This should not be affected by DVR in any way. Also in ovs agent logs I see errors on ovs-vsctl operations with ports having ofport -1.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

yes so as i commented before
the issue is here https://review.opendev.org/#/c/640258/15/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py@404
We are not reading the correct field so once we fix that we expect it to work in the dvr case.
that is noted in the Gerrit review. the failures are all cased by the fact we try to read a filed that
does not exist this raise an error and the port does not become acitve. we the reach the timout on the nova side and role back the vm.

it is not an error for a port to have -1 on the ovs side.
that is what we want to allow.
a ofport id of -1 mean the port is declare in the control plane e.g.
in the ovs-db but has not yet been attach to the data plane.
when the tap deive is actully created in the kernel it will automaticaly be
added to the ovs dataplane and a ofport id will be assigned.

so do be clear we expect ove to assing a port id of -1 and it may also raise a waring in the ovs-db
but that is normal and the intended behavior. this is what happens with vhost-user by the way as the ofport id does not get asigned until the vm startrs and until that point it will have a ofportid of -1 not []

Revision history for this message
Paul Peereboom (peereb) wrote :
Download full text (4.1 KiB)

Hi Sean,

I've finally managed to setup a Stein environment to test the isolate_vif option. I tried to set isolate_vif in nova.conf but instances than fails to spawn with the following in the nova-compute.log:

2019-12-18 13:32:07.180 30660 ERROR vif_plug_ovs.ovsdb.impl_vsctl [req-303dc31c-3c1e-446e-8f1e-83b604cee0c6 68910e67a8cb49e1af2d2bb3b4e3fa49 d5d22a5d7e194a0789a07a3c4d2affd6 - default default] Unable to e
xecute ['ovs-vsctl', '--timeout=120', '--oneline', '--format=json', '--db=tcp:127.0.0.1:6640', '--', '--may-exist', 'add-port', 'br-int', 'qvo5639e4bf-52', '--', 'set', 'Interface', 'qvo5639e4bf-52', 'ext
ernal_ids:iface-id=5639e4bf-52d7-4674-9084-e258acb47fae', 'external_ids:iface-status=active', 'external_ids:attached-mac=fa:16:3e:d8:ff:c8', 'external_ids:vm-uuid=1a6cfd1e-7414-4ac3-8c5d-81918a445ab5', 't
ag=4095']. Exception: Unexpected error while running command.
Command: ovs-vsctl --timeout=120 --oneline --format=json --db=tcp:127.0.0.1:6640 -- --may-exist add-port br-int qvo5639e4bf-52 -- set Interface qvo5639e4bf-52 external_ids:iface-id=5639e4bf-52d7-4674-9084
-e258acb47fae external_ids:iface-status=active external_ids:attached-mac=fa:16:3e:d8:ff:c8 external_ids:vm-uuid=1a6cfd1e-7414-4ac3-8c5d-81918a445ab5 tag=4095
Exit code: 1
Stdout: ''
Stderr: 'ovs-vsctl: Interface does not contain a column whose name matches "tag"\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
2019-12-18 13:32:07.254 30660 INFO os_vif [req-303dc31c-3c1e-446e-8f1e-83b604cee0c6 68910e67a8cb49e1af2d2bb3b4e3fa49 d5d22a5d7e194a0789a07a3c4d2affd6 - default default] Successfully plugged vif VIFBridge(
active=False,address=fa:16:3e:d8:ff:c8,bridge_name='qbr5639e4bf-52',has_traffic_filtering=True,id=5639e4bf-52d7-4674-9084-e258acb47fae,network=Network(c4197b2d-f568-4bd2-9b0a-eb0fa05e7428),plugin='ovs',po
rt_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap5639e4bf-52')
2019-12-18 13:32:07.275 30660 ERROR vif_plug_ovs.ovsdb.impl_vsctl [req-be4f125b-e4e5-427d-ba84-b0fb86478807 68910e67a8cb49e1af2d2bb3b4e3fa49 d5d22a5d7e194a0789a07a3c4d2affd6 - default default] Unable to e
xecute ['ovs-vsctl', '--timeout=120', '--oneline', '--format=json', '--db=tcp:127.0.0.1:6640', '--', '--may-exist', 'add-port', 'br-int', 'qvo8e2e0cb9-27', '--', 'set', 'Interface', 'qvo8e2e0cb9-27', 'ext
ernal_ids:iface-id=8e2e0cb9-2785-4d09-835d-f20897035cf2', 'external_ids:iface-status=active', 'external_ids:attached-mac=fa:16:3e:41:40:43', 'external_ids:vm-uuid=4773263b-d538-4e1b-b405-7c39c96eb7e2', 't
ag=4095']. Exception: Unexpected error while running command.
Command: ovs-vsctl --timeout=120 --oneline --format=json --db=tcp:127.0.0.1:6640 -- --may-exist add-port br-int qvo8e2e0cb9-27 -- set Interface qvo8e2e0cb9-27 external_ids:iface-id=8e2e0cb9-2785-4d09-835d
-f20897035cf2 external_ids:iface-status=active external_ids:attached-mac=fa:16:3e:41:40:43 external_ids:vm-uuid=4773263b-d538-4e1b-b405-7c39c96eb7e2 tag=4095
Exit code: 1
Stdout: ''
Stderr: 'ovs-vsctl: Interface does not contain a column whose name matches "tag"\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

Open...

Read more...

Revision history for this message
Paul Peereboom (peereb) wrote :

Aha think I got it, looks like the tag order is wrong:

Fails:
ovs-vsctl --timeout=120 --oneline --format=json --db=tcp:127.0.0.1:6640 -- --may-exist add-port br-int qvo8e2e0cb9-27 -- set Interface qvoc94a40f4-a3 external_ids:iface-id=8e2e0cb9-2785-4d09-835d-f20897035cf2 external_ids:iface-status=active external_ids:attached-mac=fa:16:3e:41:40:43 external_ids:vm-uuid=4773263b-d538-4e1b-b405-7c39c96eb7e2 tag=4095

Works:
ovs-vsctl --timeout=120 --oneline --format=json --db=tcp:127.0.0.1:6640 -- --may-exist add-port br-int qvo8e2e0cb9-27 tag=4095 -- set Interface qvoc94a40f4-a3 external_ids:iface-id=8e2e0cb9-2785-4d09-835d-f20897035cf2 external_ids:iface-status=active external_ids:attached-mac=fa:16:3e:41:40:43 external_ids:vm-uuid=4773263b-d538-4e1b-b405-7c39c96eb7e2

tags: removed: neutron-proactive-backport-potential
ABDULLAH (l7kx)
Changed in neutron:
status: Fix Committed → Confirmed
status: Confirmed → New
Changed in nova:
status: In Progress → Fix Released
Changed in neutron:
status: New → Incomplete
status: Incomplete → Confirmed
Changed in neutron:
assignee: sean mooney (sean-k-mooney) → Rodolfo Alonso (rodolfo-alonso-hernandez)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/640258
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7fd2725cb14b81f442eb57a38755829270ff2c43
Submitter: Zuul
Branch: master

commit 7fd2725cb14b81f442eb57a38755829270ff2c43
Author: Sean Mooney <email address hidden>
Date: Fri Mar 1 04:43:20 2019 +0000

    Do not skip ports with ofport unset or invalid

    This change removes the "_check_ofport" function and its use form
    the ovs_lib.py file.

    By skipping ports without a unique ofport in the "get_vifs_by_ids"
    and "get_vifs_by_id" functions, the OVS agent incorrectly treated
    newly added port with an ofport of -1 as removed ports in the
    "treat_devices_added_or_updated" function.

    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: I79158baafbb99bee99a1d687039313eb454d3a9b
    Partial-Bug: #1734320
    Partial-Bug: #1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/748296

tags: added: neutron-proactive-backport-potential
Revision history for this message
sean mooney (sean-k-mooney) wrote :

by the way while the neutron patch is now merged the nova patch to delegate the pluggin to os-vif is blocked as the neutron api seams to be sending network-vif-plugged events at the wrong time when we are evacuating nad during some other lifecycle operations like revert. so untill we figure out that the backport fo the most resent neturon patch will do nothing help adress this.

the way nova is using neutron has not changed so libvirt will still delete the port on ovs and recreated it defeating the work that has been done to validate the netowrking is set up before we migrate.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

with that said it is safe to backport the neutron patch that recnetly merged and it will have to be backported before the nova patch is but we were recommending waiting for the nova patch and any other neutron patche that might be required to merge first before starting the backports
we only want to backport a fully working solution

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by sean mooney (<email address hidden>) on branch: master
Review: https://review.opendev.org/748296

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/754475

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/753314
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c8a819aff4db62e58192af0a272a7f1ce7923146
Submitter: Zuul
Branch: master

commit c8a819aff4db62e58192af0a272a7f1ce7923146
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Sep 24 09:44:47 2020 +0000

    Filter out port with invalid ofport in OVS firewall

    Since [1], "get_vif_port_by_id" is also returning ports with an
    invalid ofport. OVS firewall cannot set an OpenFlow rule for a port
    without a valid ofport. "get_ovs_port" should filter out those ports.

    Related-Bug: #1815989
    Related-Bug: #1734320

    [1]https://review.opendev.org/#/c/640258/

    Change-Id: Id12486b3127ab4ac8ad9ef2b3641da1b79a25a50

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/756478

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/756478
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=24dd977c2217bb10baf2cf226eab2bd0e5b47d62
Submitter: Zuul
Branch: stable/victoria

commit 24dd977c2217bb10baf2cf226eab2bd0e5b47d62
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Sep 24 09:44:47 2020 +0000

    Filter out port with invalid ofport in OVS firewall

    Since [1], "get_vif_port_by_id" is also returning ports with an
    invalid ofport. OVS firewall cannot set an OpenFlow rule for a port
    without a valid ofport. "get_ovs_port" should filter out those ports.

    Related-Bug: #1815989
    Related-Bug: #1734320

    [1]https://review.opendev.org/#/c/640258/

    Change-Id: Id12486b3127ab4ac8ad9ef2b3641da1b79a25a50
    (cherry picked from commit c8a819aff4db62e58192af0a272a7f1ce7923146)

tags: added: in-stable-victoria
Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/602432
Committed: https://opendev.org/openstack/nova/commit/a62dd42c0dbb6b2ab128e558e127d76962738446
Submitter: "Zuul (22348)"
Branch: master

commit a62dd42c0dbb6b2ab128e558e127d76962738446
Author: Stephen Finucane <email address hidden>
Date: Fri Apr 30 12:51:35 2021 +0100

    libvirt: Delegate OVS plug to os-vif

    os-vif 1.15.0 added the ability to create an OVS port during plugging
    by specifying the 'create_port' attribute in the 'port_profile' field.
    By delegating port creation to os-vif, we can rely on it's 'isolate_vif'
    config option [1] that will temporarily configure the VLAN to 4095
    (0xfff), which is reserved for implementation use [2] and is used by
    neutron to as a dead VLAN [3]. By doing this, we ensure VIFs are plugged
    securely, preventing guests from accessing other tenants' networks
    before the neutron OVS agent can wire up the port.

    This change requires a little dance as part of the live migration flow.
    Since we can't be certain the destination host has a version of os-vif
    that supports this feature, we need to use a sentinel to indicate when
    it does. Typically we would do so with a field in
    'LibvirtLiveMigrateData', such as the 'src_supports_numa_live_migration'
    and 'dst_supports_numa_live_migration' fields used to indicate support
    for NUMA-aware live migration. However, doing this prevents us
    backporting this important fix since o.vo changes are not backportable.
    Instead, we (somewhat evilly) rely on the free-form nature of the
    'VIFMigrateData.profile_json' string field, which stores JSON blobs and
    is included in 'LibvirtLiveMigrateData' via the 'vifs' attribute, to
    transport this sentinel. This is a hack but is necessary to work around
    the lack of a free-form "capabilities" style dict that would allow us do
    backportable fixes to live migration features.

    Note that this change has the knock on effect of modifying the XML
    generated for OVS ports: when hybrid plug is false will now be of type
    'ethernet' rather than 'bridge' as before. This explains the larger than
    expected test damage but should not affect users.

    [1] https://opendev.org/openstack/os-vif/src/tag/2.4.0/vif_plug_ovs/ovs.py#L90-L93
    [2] https://en.wikipedia.org/wiki/IEEE_802.1Q#Frame_format
    [3] https://answers.launchpad.net/neutron/+question/231806

    Change-Id: I11fb5d3ada7f27b39c183157ea73c8b72b4e672e
    Depends-On: Id12486b3127ab4ac8ad9ef2b3641da1b79a25a50
    Closes-Bug: #1734320
    Closes-Bug: #1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/790447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/797142

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/797144

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/797291

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/797316

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/797428

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/797142
Committed: https://opendev.org/openstack/nova/commit/99cf5292c782828ea33a3448ae0e7da028d5d176
Submitter: "Zuul (22348)"
Branch: master

commit 99cf5292c782828ea33a3448ae0e7da028d5d176
Author: Stephen Finucane <email address hidden>
Date: Fri Jun 18 17:55:50 2021 +0100

    objects: Fix VIFMigrateData.supports_os_vif_delegation setter

    We're using a 'profile' getter/setter in the 'VIFMigrateData' object to
    ensure we transform the JSON encoded string we store in the database to
    an actual dictionary on load and vice versa on save. However, because
    the getter is returning a new object (constructed from 'json.loads')
    rather than a reference to the original data (which is a string),
    modifying this object doesn't actually change the underlying data in the
    object. We were relying on this broken behavior to set the
    'supports_os_vif_delegation' attribute of the 'VIFMigrateData' object
    and trigger the delegated port creation introduced in change
    I11fb5d3ada7f27b39c183157ea73c8b72b4e672e, which means that code isn't
    actually doing anything yet. Resolve this.

    Change-Id: I362deb1088c88cdcd8219922da9dc9a01b10a940
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1734320
    Related-Bug: #1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/797428
Committed: https://opendev.org/openstack/nova/commit/fa0fb2fe3d61de1cb871c48ee97053cf2fb5827a
Submitter: "Zuul (22348)"
Branch: master

commit fa0fb2fe3d61de1cb871c48ee97053cf2fb5827a
Author: Stephen Finucane <email address hidden>
Date: Tue Jun 22 11:37:22 2021 +0100

    libvirt: Always delegate OVS plug to os-vif

    In change I11fb5d3ada7f27b39c183157ea73c8b72b4e672e, we started
    delegating plugging of OVS ports to os-vif to work around a number of
    bugs. However, this was only introduced for live migration. Plugging is
    still handled by libvirt for spawn. This results in an odd situation,
    whereby an interface of type 'bridge' will be use when creating the
    instance initially, only for this to change to 'ethernet' on live
    migration. Resolve this by *always* delegating plugging to os-vif. This
    is achieved by consistently setting the 'delegate_create' attribute of
    'nova.network.model.VIF' to 'True', which will later get transformed to
    the 'create_port' attribute of the 'os_vif.objects.vif.VIFOpenVSwitch'
    object(s) created in 'nova.network.os_vif_util._nova_to_osvif_vif_ovs'
    and ultimately result in delegate port creation.

    Note that we don't need to worry about making the setting of
    'delegate_create' conditional on whether we're looking at an OVS port or
    not: this will be handled by '_nova_to_osvif_vif_ovs'. We also don't
    need to worry about unsetting this attribute before a live migration:
    the 'delegate_create' attribute is always overridden as part of
    'nova.objects.migrate_data.VIFMigrateData.get_dest_vif'.

    Change-Id: I014c5a81752f86c6b99d19d769c42f318e18e676
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-Bug: #1734320
    Related-Bug: #1815989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Displaying first 40 and last 40 comments. View all 111 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.