Connection to an instance with floating IP breaks during block migration when using DVR

Bug #1456073 reported by Itzik Brown on 2015-05-18
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Swaminathan Vasudevan
Mitaka
Undecided
Unassigned
neutron
High
Swaminathan Vasudevan

Bug Description

During migration of an instance, using block migration with a floating IP when the router is DVR the connection to the instance breaks (e.g. Having an SSH connection to the instance).
Reconnect to the instance is successful.

Version
======
RHEL 7.1
python-nova-2015.1.0-3.el7ost.noarch
python-neutron-2015.1.0-1.el7ost.noarch

How to reproduce
==============
1. Create a distributed router and attach an internal and an external network to it.
    # neutron router-create --distributed True router1
    # neutron router-interface-add router1 <subnet1 id>
    # neutron router-gateway-set <external network id>

2. Launch an instance and associate it with a floating IP.
    # nova boot --flavor m1.small --image fedora --nic net-id=<internal network id> vm1

3. SSH into the instance which will be migrated and run a command "while true; do echo "Hello"; sleep 1; done"

4. Migrate the instance using block migration
     # nova live-migration --block-migrate <instance id>

5. Verify that the connection to the instance is lost.

shihanzhang (shihanzhang) wrote :

I think the reason is that after a VM live-migration from compute node A to compute node B, the l2 agent in compute node B need a little time to scan this port and get its info from neutron-server through RPC, so the connection will break

Gary Kotton (garyk) on 2015-11-19
tags: added: l3-dvr-backlog

Right now we don't have any pre-live-migration event notification from the NOVA to the NEUTRON, in order for Neutron to take necessary action to pre-emptively deploy the router and the additional hooks.

We need to have a Nova to Neutron handshake on pre-live-migration and when the Neutron notifies the nova with a success message, then nova should migrate.
Otherwise we will be seeing this problem.

Hardik Italia (hardik-italia) wrote :

Able to reproduce the issue only with a DVR router.
Marking this bug as confirmed.

Changed in neutron:
status: New → Confirmed
Oleg Bondarev (obondarev) wrote :

I'm not sure that waiting on pre live migration step for neutron to schedule and apply DVR router on destination node would help to avoid connection loss: with legacy routers floating ips are actually assigned on controllers and this doesn't change when migrating. In DVR case floating ip is actually reassigned from one compute to another during migration.

SSH is based on TCP, so the connection breaks only if the the disruption is so long to cause a timeout. I wonder if extending the timeout would allow us to see no perceived application disruption. If that was the case, then yes...this is most likely affected by the fact that it takes time for Neutron to react to the live migration event and reprovision resources on the landing host.

@Oleg: Ensuring that DVR namespaces are on the 'landing' host before the VM gets actually migrated (what Swami is referring to as 'pre-live-migration-step') would indeed help the DVR case, because that's instrumental for the floating IP traffic to flow through the VM's fixed IP. Wouldn't you agree?

Changed in neutron:
importance: Undecided → High

I do believe this needs some extra coordination with Nova, but I don't quite have the full state machine picture of the live migration action.

Changed in nova:
status: New → Confirmed
Oleg Bondarev (obondarev) wrote :

@Armando, I'm sure it will improve things if we wait while dvr namespaces are created on the destination host, I'm just not sure it would be enough to avoid connection loss. Anyway it's worth a try

Oleg Bondarev (obondarev) wrote :

Some questions about floating ip migration which I think are worth discussing within neutron team before starting a cross-project discussion on this:
 - what is the right time for a floating ip to be created on destination host, if there is any for DVR scenario?
 - on pre migration step Floating IP should be active on the source host where VM is still located, so we can't prepare everything on destination host on pre migration (even if nova would wait for it). Or can we?
 - can we have same Floating IP on two nodes at the same time and what are side effects?
 - if we can will it (how will it) help to not loose ssh connection during live migration?
 - if we can't what is the right moment for floating ip transition?
 - if we have a notification from nova to neutron at the right moment (when VM is actually "migrating" to the destination) will it still be enough to preserve ssh connection? (in this case neutron server should notify l3 agent and agent needs to process the router, which currently means going back to server for router info)

Oleg in my opinion, the right time to create the floatingip infrastructure would be before the vm actually migrates and is planning to migrate.

1. If we get the "future_host" for migration information from the nova, we can prepare the host for the fip migration.

Oleg in my opinion, the right time to create the floatingip infrastructure would be before the vm actually migrates and is planning to migrate.

1. If we get the "future_host" for migration information from the nova, we can prepare the host for the fip migration - like
       Create Router namespace
       Create FIP namespace
       Associate the Router and FIP Namespace.
      I have made some headway with this on this patch.
     https://review.openstack.org/#/c/259171/

2. In order for this to be there, we have to track the port with respect to the "old_host", "cur_host" and "new_host" or "future_host".
   For this I would suggest that we make changes to the port-binding table to handle all "host" changes.
  In this case the old_host and the cur_host can be the same. The new_host denotes where the port is intended to move. Once we get this information, the server can pre-populate the details and send it to the agent to create the fip namespace.
  In order to address this I have already created a patch.
  https://review.openstack.org/#/c/259299/

3. The thing that we need more should we need to have a different type of "event_notifier" such as "MIGRATE_START" or "MIGRATE_END" for the port, or else are we going to make use of the same "UPDATE_PORT", "BEFORE_UPDATE" for this. -- This should be considered.

4. With all this infrastructure, when "NOVA" provides us a notification before "pre-migration" to setup the L3, then we can go ahead and create it.

5. If there are any other issues on the neutron side, we can notify 'NOVA" that network-is-not-ready for migration and NOVA should take necessary action.

6. If everything is fine, we send a "OK" message, and NOVA will proceed with the migration.

7. If NOVA errors out, it should send a reply back to us and about its state and we should revert the state on our side.

Please let me know if you have any other questions.

Here is the event flow diagram between Nova and Neutron for Notification on Live Migration.

https://drive.google.com/file/d/0B4kh-7VVPWlPZkpHMExMMHBYTTA/view?usp=sharing

Paul Murray (pmurray) wrote :

Nova already has a method to set up networks on the destination host during the pre -live migration phase. See the pre_live_migrate() method in nova/compute/manager.py.

For neutron the pre_live_migrate() method is empty - but for nova networking it is used to setup the networks on the destination in advance of the migration. I think this method should be defined to call neutron to set up the networks as you want for DVR. So long as this is not a very long operation it woud be appropriate for nova to block at this point waiting for the network setup to be done.

If the pre_live_migration() method throws an exception the migration is roled back. I think that would provide all that is needed for the nova side.

Paul Murray (pmurray) wrote :

Oops, correction to above comment:

The pre_live_migration() method calls setup_networks_on_host() in nova's internal network api. It is the version of that method in the api specialised for neutron that is empty. The nova network version of that method does something.

We need to populate setup_networks_on_host() with code that will call neutron. That method is in the file: nova/network/neutronv2/api.py

Paul thanks for your suggestions.
Seems more promising.

Paul I was just looking at the "setup_networks_on_host" in the nova/network/neutronv2/api.py. That is where the api is not implemented.

Do you still think it should be handled through an API instead of a notifier.
We don't expose today the network readiness through an API, but instead of an API it would be better if there is a notifier message similar to the one that we have from Neutron to Nova and that would be more clean.

Please let me know your thoughts.

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: Confirmed → In Progress
Changed in nova:
status: Confirmed → In Progress
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: master
Review: https://review.openstack.org/259299
Reason: Abandon this patch, since I have an alternate one to address the same issue.

Changed in neutron:
milestone: none → mitaka-3
Matt Riedemann (mriedem) on 2016-02-25
tags: added: live-migration network neutron
Changed in neutron:
milestone: mitaka-3 → mitaka-rc1
tags: added: mitaka-rc-potential
Matt Riedemann (mriedem) wrote :

This is not a regression in mitaka, it's just a latent bug for a feature that's already existed.

tags: removed: mitaka-rc-potential
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Armando Migliaccio (armando-migliaccio)
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Swaminathan Vasudevan (swaminathan-vasudevan)

Reviewed: https://review.openstack.org/275420
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f0bdb798fa14b7bd5649d98789e71803127dd9f7
Submitter: Jenkins
Branch: master

commit f0bdb798fa14b7bd5649d98789e71803127dd9f7
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 2 12:44:16 2016 -0800

    DVR:Pro-active router creation with live migration

    Today DVR routers are created after a dvr service port is
    seen on a given node. But in the case of instance live
    migration, the creation of l3 routed networks on the
    destination node is delayed since we react to the event.

    This patch tries to proactively create routers on the
    destination node based on the portbinding profile info
    updated by the nova when the instance is on a pre-migration
    state.

    Nova calls setup_network_on_host during the pre-migration
    phase and we update the portbinding profile dict with
    an attribute 'migrating_to' as shown below

    port:{'binding:profile':{'migrating_to': 'host'}}

    where 'host' points to the 'destination' of the port.

    L3 plugin will verify the migration profile for the port on
    any port update and then take action to create routers in the
    respective agents if routers have not been created.

    If the live migration fails or if reverted, then the port
    binding profile attribute 'migrating_to' will be cleared from
    the port profile. In this case, the router and the fip namespace
    may be created on the destination node, but since the VM did
    not land on the destination node, it would not cause any issues,
    since the traffic will still be flowing out from the origination
    node, except for the existence of the router and fip namespace.

    For some reason if the creation of the router namespace and fip
    namespace fails, then the live migration may still proceed as
    it is now and the agent will create the router namespace and fip
    namespace reactively.

    The case were we report status back to Nova and Nova reacting
    to the setup_networks_on_host status will be handled in the
    upcoming release.

    This patch should not affect any upgrades with respect to the
    agent or server.

    Change-Id: Ibb62f012333cfdfd468bafdc0b4501aa46b4b54d
    Related-Bug: #1456073

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/260738
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c04587110fef24c189d439bacbbbc7105085cbe1
Submitter: Jenkins
Branch: master

commit c04587110fef24c189d439bacbbbc7105085cbe1
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Dec 22 13:40:33 2015 -0800

    DVR: Agent side change for live migration with floatingip

    During live migration when an instance is in a pre-migration
    state, and if the fixed_ip of the port has an associated
    floatingip, the floatingip namespace should be created in
    the destination node, before the VM instance lands on the
    node.

    The server side code will handle the pre-live migration case
    and initiate the router creation on the destination node and
    also will provide the 'dest_host' as an additional attribute to
    the floatingips dictionary that is being passed to the agent.

    So this patch reads the 'dest_host' and the 'host' variable
    and if any of the two matches with the host, it will allow
    the floatingip to be processed.

    This will be an agent side change for addressing the vm
    migration with Floatingip enabled.

    Closes-Bug: #1456073
    Change-Id: Idfbea7f3c66d6a1df5d3050912d620591c69b614

Reviewed: https://review.openstack.org/292916
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ccbef43f5bd1f7c393d7ccc90e2ba79f42f3e1e6
Submitter: Jenkins
Branch: master

commit ccbef43f5bd1f7c393d7ccc90e2ba79f42f3e1e6
Author: Armando Migliaccio <email address hidden>
Date: Tue Mar 15 07:31:39 2016 -0700

    Improve release notes for dvr fixes

    Change-Id: Ida1165add974207a4ea25696d26e1daae7914288
    Related-bug: #1456073

This issue was fixed in the openstack/neutron 8.0.0.0rc1 release candidate.

I believe Live migration fixes are a priority for Mitaka

tags: added: mitaka-rc-potential
Changed in nova:
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

Live migration is a priority, but this is an old latent bug and we could consider backporting the fix as long as there aren't RPC API changes required, which I don't think there are.

tags: added: mitaka-backport-potential
removed: mitaka-rc-potential
Paul Murray (pmurray) on 2016-09-07
tags: added: newton-rc-potential

Reviewed: https://review.openstack.org/275073
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=55f3d476a12dce8a70d3e485f0f2f9c752cf0b3d
Submitter: Jenkins
Branch: master

commit 55f3d476a12dce8a70d3e485f0f2f9c752cf0b3d
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 2 00:35:24 2016 -0800

    Implement setup_networks_on_host for Neutron networks

    setup_networks_on_host has not been implemented for
    neutron networks and only implemented for nova net.

    In order to address the L3 network issues related to the
    live-migration in neutron, 'setup_networks_on_host' should
    be implemented in the neutronv2/api.

    This patch implements the function and updates the portbinding
    profile dictionary with the 'migrating_to' key pointing to the
    destination host in pre-live migration phase.

    port:{'binding:profile':{'migrating_to': 'host'}}

    When migrate_instance_finish() is called, it should clear
    the migration profile before binding the host to the destination
    port to prevent neutron from taking any action when the port-update
    happens after the port migrates.

    Based on the port profile update with the destination host,
    the neutron will be able to create any associated L3 networks
    on the destination host.

    Further work is planned to issue a status update notification
    to nova during the pre-live migration phase after the
    L3 networks have been created on the destination host and
    before the port lands on the destination host. This will
    be addressed in a different patch, since we don't have such
    wait state in nova at this time.

    The neutron side changes are handled in different patch sets shown
    below [1] server side and [2] agent side.

    [1] https://review.openstack.org/#/c/275420/
    [2] https://review.openstack.org/#/c/260738/

    NOTE: Older versions of neutron may ignore the new port binding
    migrating profile information.

    Change-Id: Ib1cc44bf9d01baf4d1f1d26c2a368a5ca7c6ab68
    Partial-Bug: #1456073

Matt Riedemann (mriedem) on 2016-09-12
tags: removed: mitaka-backport-potential newton-rc-potential

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/367646
Reason: Hello Swaminathan,

stable/mitaka has now entered phase II support [1][2], only accepting critical bugfixes and security patches. As this review does not meet these criteria it is being abandoned at this time.

However please reopen this review if you feel it is still suitable for stable/mitaka and the nova-stable-maint team will revisit this decision.

[1] http://docs.openstack.org/project-team-guide/stable-branches.html#support-phases
[2] https://releases.openstack.org/#release-series

Jay Pipes (jaypipes) wrote :

The fix for this was merged in Newton. Marking as Fix Released.

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers