The first VM of one network in one compute node cannot send RARP packets during KVM's live-migration in a neutron ML2 hierachical port binding environment whose second mechanism driver was configured as the existing OVS driver "openvswitch"

Bug #1671379 reported by Zhipeng Shen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned
neutron
New
Undecided
Unassigned

Bug Description

Description
===========
Normaly, VM which migrates to destination node can send several RARP packets during KVM's live-migration in a simple ovs + vlan environment after a bug is fixed.
The ovs + vlan bug url:
https://bugs.launchpad.net/neutron/+bug/1414559

In neutron ML2 hierarchical port binding environment,
I find that the physical port associated to a vlan physical provider's ovs bridge on destination node cannot dump any rarp packets when VM migrates to destination node.

Steps to reproduce
==================
1. create a vxlan type network: netA
2. create a subnet for netA: subA
3. create a vm in compute1 node: vmA
4. tcpdump the physical port associated to a ovs bridge in compute2 node:
  tcpdump -i ens33 -w ens33.pcap
5. live migrate the vm to the other compute node: compute2 node
6. open ens33.pcap in wireshark

Expected result
===============
find several rarp packets

Actual result
=============
find not any rarp packets

Environment
===========
OpenStack:Kilo 2015.1.2
OS: CentOS 7.1.1503
Libvirt:1.2.17

Logs & Configs
==============
hierarchical port binding configuration:
controller node:
#neutron
/etc/neutron/plugins/ml2/ml2_conf.ini
[ml2]
type_drivers = vxlan,vlan
tenant_network_types = vxlan,vlan
mechanism_drivers=ml2_h3c,openvswitch
#ml2_h3c, a mechanism driver owned by New H3C Group which is a provider of New IT
#solutions , allocates dynamic vlan segment for the existing mechanism driver
#"openvswitch"

[ml2_type_vlan]
network_vlan_ranges = compute1_physicnet1:100:1000, compute2_physicnet1:100:1000
[ml2_type_vxlan]
vni_ranges=1:500

compute1 node:
#neutron
/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
[ovs]
bridge_mappings=compute1_physicnet1:br-ens33

compute2 node:
#neutron
/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
[ovs]
bridge_mappings=compute2_physicnet1:br-ens33

Analysis
==============
After reading the live-migration relevant code of nova, neutron-server and neutron-openvswitch-agent, I think that it may be a bug.

The brief relevant process:

1. source compute node(nova-compute) compute1 node
  self.driver(libvirt).live_migration
   dom.migrateToURI2 ---------------Excecute migration to dest node
    self._live_migration_monitor------------------ Monitor migration finished
     self._post_live_migration ---------------- Migration finished
      self.compute_rpcapi.post_live_migration_at_destination --- Notify
                                                           destination node

2.1. destination compute node (neutron-openvswitch-agent) compute2 node
   rpc_loop ------ monitor vm's tapxxxx port plug
    self.process_network_ports
     self.treat_devices_added_or_updated
      self.plugin_rpc.get_devices_details_list -------The port details shows that
                 the port still is bound to
                 "compute1_physicnet1", not the physical network
                 provider "compute2_physicnet1" existing in
                  destination compute node.
        self.treat_vif_port
         self.port_bound
          self.provision_local_vlan --- There is not matched physical bridge at
                 the time. As a result, the tap port can not been set any
                 vlan tag.Eventually, br-ens33, the physical bridge,
                 drops rarp packets from the starting vm.

2.2 destination compute node (nova-compute) compute2 node
   post_live_migration_at_destination nova/compute/manager.py
    self.network_api.migrate_instance_finish
     self._update_port_binding_for_instance ------------Notify neutron migrate port
                               binding:host_id

3. controller node(neutron-server)
   ml2_h3c: fill self._new_bound_segment and self._next_segments_to_bind with
        compute2_physicnet1 for openvswitch driver
   openvswitch: bind port with compute2_physicnet1's allocated segment from level 0
          driver ml2_h3c

In the current process of kilo, ml2 driver finishes port bind at the last step 3.
it's too late to make neutron-openvswitch-agent get suitable port details from
neutron-server to set correct vlan tag for vm port and adds relevant flow for ovs bridges
that nova notifies neutron-server the event that port changes binding_hostid in ml2
hierarchical port binding.

It seems that liberty, mitaka exists the same problem.

Zhipeng Shen (gooduone)
description: updated
description: updated
description: updated
Zhipeng Shen (gooduone)
description: updated
description: updated
description: updated
description: updated
description: updated
Zhipeng Shen (gooduone)
description: updated
description: updated
description: updated
description: updated
tags: added: live-migration
Revision history for this message
Kevin Benton (kevinbenton) wrote :

This sounds like something that wouldn't be resolved until we have the multiple port bindings spec in place to get the switch mech driver to wire up the additional vlan.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Can you see if the solution in https://bugs.launchpad.net/neutron/+bug/1414559 would have helped?

Zhipeng Shen (gooduone)
description: updated
Revision history for this message
Zhipeng Shen (gooduone) wrote :

https://bugs.launchpad.net/neutron/+bug/1414559

I read the bug 's description and fixed codes last week.

The fixed codes only can resolve this problem that one ml2 driver named "openvswitch" cannot send rarp packets during live-migration rather than the multiple ml2 driver port bindings problem I committed

Zhipeng Shen (gooduone)
information type: Public → Public Security
information type: Public Security → Public
Zhipeng Shen (gooduone)
Changed in nova:
status: New → Confirmed
status: Confirmed → New
Revision history for this message
Sivasathurappan Radhakrishnan (siva-radhakrishnan) wrote :

It might be related to this https://bugs.launchpad.net/nova/+bug/1605016. Have patch for it where port binding in done based on libvirt events instead if happening in 3 rd step as mentioned above. https://review.openstack.org/#/c/434870/. Will try to replicate this issue on top of my patch and see what happens.

Revision history for this message
Timofey Durakov (tdurakov) wrote :

Hello, Zhipeng Shen, have you tested the patch mentioned in comment #4?

Revision history for this message
Zhipeng Shen (gooduone) wrote :

Hello, Timofey Durakov. I'm sorry to reply so late.

I'm busy doing somethings recently, and I should be able to test it next month.

Revision history for this message
Zhipeng Shen (gooduone) wrote :

Hello, Timofey Durakov, Sivasathurappan Radhakrishnan
I don't test the patch mentioned in comment #4 yet.
The libvirt version of the bug (https://bugs.launchpad.net/nova/+bug/1671379) environment is 1.2.17, it is less than the lowest version of the post-copy bug ( https://bugs.launchpad.net/neutron/+bug/1414559).

I'm sure that the bug I committed is concerned with ml2-hierarchical-port-binding.

ml2-hierarchical-port-binding pages:
Specification:
http://specs.openstack.org/openstack/neutron-specs/specs/kilo/ml2-hierarchical-port-binding.html

Launchpad blueprint:
https://blueprints.launchpad.net/neutron/+spec/ml2-hierarchical-port-binding

Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version kilo in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.kilo
Sean Dague (sdague)
Changed in nova:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.