l2 population failed when bulk live migrate VMs

Bug #1483601 reported by shihanzhang
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
shihanzhang

Bug Description

when we bulk live migrate VMs, the l2 population may possiblly(not always) failed at destination compute nodes, because when nova migrate VM at destination compute node, it just update port's binding:host, the port's status is still active, from neutron perspective, the progress of port status is : active -> build -> active,
in bellow case, l2 population will fail:
1. nova successfully live migrate vm A and VM B from compute A to compute B.
2. port A and port B status are active, binding:host are compute B .
3. l2 agent scans these two port, then handle them one by one.
4. neutron-server firstly handle port A, its status will be build(remember port B status is still active), and do bellow check
in l2 population check, this check will be fail

def _update_port_up(self, context):
        ......
  if agent_active_ports == 1 or (self.get_agent_uptime(agent) < cfg.CONF.l2pop.agent_boot_time):
  # First port activated on current agent in this network,
  # we have to provide it with the whole list of fdb entries

description: updated
description: updated
yalei wang (yalei-wang)
Changed in neutron:
assignee: nobody → yalei wang (yalei-wang)
Revision history for this message
shihanzhang (shihanzhang) wrote :

@yalei wang, I have a idea to fix this bug, do you want to fix it?

Revision history for this message
yalei wang (yalei-wang) wrote :

hi hanzhang, pls assign to yourself.

Changed in neutron:
assignee: yalei wang (yalei-wang) → nobody
Revision history for this message
shihanzhang (shihanzhang) wrote :

@yalei wang, I will submit the patch as soon as possible, please help to review :)

Changed in neutron:
assignee: nobody → shihanzhang (shihanzhang)
Revision history for this message
Kevin Benton (kevinbenton) wrote :

I think the right thing to do here is to adjust ML2 to set the port status to DOWN on any ports that have the host_id updated.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Here is a patch that looks like it should work:

diff --git a/neutron/plugins/ml2/plugin.py b/neutron/plugins/ml2/plugin.py
index 904abe9..816e762 100644
--- a/neutron/plugins/ml2/plugin.py
+++ b/neutron/plugins/ml2/plugin.py
@@ -1153,8 +1153,12 @@ class Ml2Plugin(db_base_plugin_v2.NeutronDbPluginV2,
                 original_port=original_port)
             new_host_port = self._get_host_port_if_changed(
                 mech_context, attrs)
- need_port_update_notify |= self._process_port_binding(
+ binding_changed = self._process_port_binding(
                 mech_context, attrs)
+ if binding_changed:
+ need_port_update_notify = True
+ self.update_port_status(context, id, const.PORT_STATUS_DOWN)
+ updated_port['status'] = const.PORT_STATUS_DOWN
             # For DVR router interface ports we need to retrieve the
             # DVRPortbinding context instead of the normal port context.
             # The normal Portbinding context does not have the status

Revision history for this message
shihanzhang (shihanzhang) wrote :

hi kevin, thx for your nice patch!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/215467

Changed in neutron:
status: New → In Progress
tags: added: l2-pop
Revision history for this message
Mathieu Rohon (mathieu-rohon) wrote :

I would not go for a change to DOWN state when the port is migrated. If the migration fails the port wouldn't be available during the live migration process. I'm not even even sure that the port would move back to UP after a failure. It depends on what nova does with port attachment during live migration.

A new state such as "MIGRATING" seems consistent with the nova state during live migration.

Revision history for this message
Mathieu Rohon (mathieu-rohon) wrote :

here is a call flow that I did few month ago that might help :

http://paste.openstack.org/show/198298/

Revision history for this message
shihanzhang (shihanzhang) wrote :

@Mathieu Rohon, your suggestion is very good, but now nova just update the port's host_id when it finish live-migration(as I know, nova update port's host_id in its steps of post_migrate), so I think there is not the problems you said

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

Mathieu,

i really think we should have a similar diagram in devref!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/215467
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c5fa665de3173f3ad82cc3e7624b5968bc52c08d
Submitter: Jenkins
Branch: master

commit c5fa665de3173f3ad82cc3e7624b5968bc52c08d
Author: shihanzhang <email address hidden>
Date: Fri Aug 21 09:51:59 2015 +0800

    ML2: update port's status to DOWN if its binding info has changed

    This fixes the problem that when two or more ports in a network
    are migrated to a host that did not previously have any ports in
    the same network, the new host is sometimes not told about the
    IP/MAC addresses of all the other ports in the network. In other
    words, initial L2population does not work, for the new host.

    This is because the l2pop mechanism driver only sends catch-up
    information to the host when it thinks it is dealing with the first
    active port on that host; and currently, when multiple ports are
    migrated to a new host, there is always more than one active port so
    the condition above is never triggered.

    The fix is for the ML2 plugin to set a port's status to DOWN when
    its binding info changes.

    This patch also fixes the bug when nova thinks it should not wait
    for any events from neutron because all ports are already active.

    Closes-bug: #1483601
    Closes-bug: #1443421
    Closes-Bug: #1522824
    Related-Bug: #1450604

    Change-Id: I342ad910360b21085316c25df2154854fd1001b2

Changed in neutron:
status: In Progress → Fix Released
Assaf Muller (amuller)
tags: added: liberty-backport-potential
tags: added: kilo-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/300539

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/300559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/300539
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a38cb93dde1633005e9e66e6b7ecec9e726304bb
Submitter: Jenkins
Branch: stable/liberty

commit a38cb93dde1633005e9e66e6b7ecec9e726304bb
Author: venkata anil <email address hidden>
Date: Fri Apr 1 14:52:01 2016 +0000

    ML2: update port's status to DOWN if its binding info has changed

    This fixes the problem that when two or more ports in a network
    are migrated to a host that did not previously have any ports in
    the same network, the new host is sometimes not told about the
    IP/MAC addresses of all the other ports in the network. In other
    words, initial L2population does not work, for the new host.

    This is because the l2pop mechanism driver only sends catch-up
    information to the host when it thinks it is dealing with the first
    active port on that host; and currently, when multiple ports are
    migrated to a new host, there is always more than one active port so
    the condition above is never triggered.

    The fix is for the ML2 plugin to set a port's status to DOWN when
    its binding info changes.

    This patch also fixes the bug when nova thinks it should not wait
    for any events from neutron because all ports are already active.

    Closes-bug: #1483601
    Closes-bug: #1443421
    Closes-Bug: #1522824
    Related-Bug: #1450604
    (cherry picked from commit c5fa665de3173f3ad82cc3e7624b5968bc52c08d)

    Conflicts: neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Change-Id: I342ad910360b21085316c25df2154854fd1001b2

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/306300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/kilo)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/300559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Dave Walker (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/306300
Reason:
stable/kilo closed for 2015.1.4

This release is now pending its final release and no freeze exception has
been seen for this changeset. Therefore, I am now abandoning this change.

If this is not correct, please urgently raise a thread on openstack-dev.

More details at: https://wiki.openstack.org/wiki/StableBranch

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 7.1.0

This issue was fixed in the openstack/neutron 7.1.0 release.

tags: removed: kilo-backport-potential liberty-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.