floating ip not reachable after vm migration

Bug #1585165 reported by Rossella Sblendido
42
This bug affects 8 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Rossella Sblendido

Bug Description

On a cloud running Liberty, a VM is assigned a floating IP. The VM is live migrated and the floating IP is no longer reachable from outside the cloud. Steps to reproduce:

1) spawn a VM
2) assign a floating IP
3) live migrate the VM
4) ping the floating IP from outside the cloud

the problem seems to be that both the node that was hosting the VM before the migration and the node that hosts it now answers the ARP request:

admin:~ # arping -I eth0 10.127.128.12
ARPING 10.127.128.12 from 10.127.0.1 eth0
Unicast reply from 10.127.128.12 [FA:16:3E:C8:E6:13] 305.145ms
Unicast reply from 10.127.128.12 [FA:16:3E:45:BF:9E] 694.062ms
Unicast reply from 10.127.128.12 [FA:16:3E:45:BF:9E] 0.964ms

on the compute that was hosting the VM:

root:~ # sudo ip netns exec fip-c622fafe-c663-456a-8549-ebd3dbed4792 ip route
default via 10.127.0.1 dev fg-c100b010-af
10.127.0.0/16 dev fg-c100b010-af proto kernel scope link src 10.127.128.3
10.127.128.12 via 169.254.31.28 dev fpr-7d1a001a-9

On the node that it's hosting the VM:

root:~ # sudo ip netns exec fip-c622fafe-c663-456a-8549-ebd3dbed4792 ip route
default via 10.127.0.1 dev fg-e532a13f-35
10.127.0.0/16 dev fg-e532a13f-35 proto kernel scope link src 10.127.128.8 9
10.127.128.12 via 169.254.31.28 dev fpr-7d1a001a-9

the entry "10.127.128.12" is present in both nodes. That happens because when the VM is migrated no clean up is triggered on the source host. Restarting the l3 agent fixes the problem because the stale entry is removed.

Changed in neutron:
importance: Undecided → High
assignee: nobody → Rossella Sblendido (rossella-o)
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

From the description, I know that DVR is involved here. This is a known issue that Swami has been working on. I know that it was discussed in the Nova mid-cycle in January and at a few other times. I'm going to assign this bug to Swami for inspection and triage. I suspect he'll mark it as a duplicate of some bug.

Changed in neutron:
assignee: Rossella Sblendido (rossella-o) → Swaminathan Vasudevan (swaminathan-vasudevan)
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

This bug seems to me a duplicate of https://bugs.launchpad.net/neutron/+bug/1456073.

There were couple of patches for this bug that was fixed in Mitaka.

https://review.openstack.org/#/c/275420/
https://review.openstack.org/#/c/260738/

There is still a patch in Nova that has not merged to address this issue.
https://review.openstack.org/#/c/275073/

I should need to triage this bug and see if this is still exhibiting the same behavior. The issue that I noticed when we live migrate is we send the GARP from the floatingip namespace that is created on the new node and so the external entity that is trying to reach the floatingip will now try to ping both and then the first one would fail, since we already removed the floatingip from the first node and will succeed from the second node.

Revision history for this message
Rossella Sblendido (rossella-o) wrote :

Swami I don't think it's a duplicate. I can reproduce it all the time when using Liberty.
I don't think the GARP from the new node is to blame.
I get ARP replies for the same floating ip from both nodes (the new one and the old one) even a long time after the migration.

Cleaning the route in the fip namespace of the old host:

10.127.128.12 via 169.254.31.28 dev fpr-7d1a001a-9

that is the route for the fip of the vm that migrated solves the issue.

NOTE: pinging the fip from the nodes of the cloud works all the time. To reproduce you need to ping from an external host.

Revision history for this message
Rossella Sblendido (rossella-o) wrote :

Swami to fix this I think we need to detect that the migration has ended and write a clean up. What do you think?

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Rossella, you are right.
https://github.com/openstack/neutron/blob/master/neutron/db/l3_dvrscheduler_db.py#L420
This is the line were we check for the port-hostbinding has changed or not.

But currently this is being checked only for ports that does not have a device_owner and also only router_remove conditions are checked.

We might have to check if the associated port has floatingips on the original host and if so, that floatingip should be removed.

Probably it should call the 'l3plugin.delete_floatingip()' in order to notify the right host that is hosting the floatingip.

tags: added: l3-dvr-backlog
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/327551

Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Rossella Sblendido (rossella-o)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/327551
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a1f06fd707ffe663e09f2675316257c8dc528d47
Submitter: Jenkins
Branch: master

commit a1f06fd707ffe663e09f2675316257c8dc528d47
Author: rossella <email address hidden>
Date: Wed Jun 8 17:18:51 2016 +0200

    After a migration clean up the floating ip on the source host

    When a VM is migrated that has a floating IP associated, the L3
    agent on the source host should be notified when the migration
    is over. If the router on the source host is not going to be
    removed (there are other ports using it) then we should nofity
    that the floating IP needs to be cleaned up.

    Change-Id: Iad6fbad06cdd33380ef536e6360fd90375ed380d
    Closes-bug: #1585165

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/328464

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/329107

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/328464
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=83ca27f52ae994660eb9cb9cd040dc409b417452
Submitter: Jenkins
Branch: stable/mitaka

commit 83ca27f52ae994660eb9cb9cd040dc409b417452
Author: rossella <email address hidden>
Date: Wed Jun 8 17:18:51 2016 +0200

    After a migration clean up the floating ip on the source host

    When a VM is migrated that has a floating IP associated, the L3
    agent on the source host should be notified when the migration
    is over. If the router on the source host is not going to be
    removed (there are other ports using it) then we should nofity
    that the floating IP needs to be cleaned up.

    Change-Id: Iad6fbad06cdd33380ef536e6360fd90375ed380d
    Closes-bug: #1585165
    (cherry picked from commit a1f06fd707ffe663e09f2675316257c8dc528d47)

tags: added: in-stable-mitaka
tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/329107
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a6e020cc199df0eafe7b4fdcd92fa74881d27dfa
Submitter: Jenkins
Branch: stable/liberty

commit a6e020cc199df0eafe7b4fdcd92fa74881d27dfa
Author: rossella <email address hidden>
Date: Wed Jun 8 17:18:51 2016 +0200

    After a migration clean up the floating ip on the source host

    When a VM is migrated that has a floating IP associated, the L3
    agent on the source host should be notified when the migration
    is over. If the router on the source host is not going to be
    removed (there are other ports using it) then we should nofity
    that the floating IP needs to be cleaned up.

    Closes-bug: #1585165
    (cherry picked from commit a1f06fd707ffe663e09f2675316257c8dc528d47)

    Conflicts:
     neutron/db/l3_dvrscheduler_db.py
     neutron/tests/unit/scheduler/test_l3_agent_scheduler.py
    Change-Id: Iad6fbad06cdd33380ef536e6360fd90375ed380d

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b2

This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.1.2

This issue was fixed in the openstack/neutron 7.1.2 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 8.2.0

This issue was fixed in the openstack/neutron 8.2.0 release.

Revision history for this message
Luan Nguyen (nguyenhuuluan434-w) wrote :

Hi All,
I have the same problem, but i had patched with neutron 7.1.2 branch in-stable-liberty when i live-migrate, resize or migration i dump traffic on physical port only see reply package ARP on the source host and destination host after live-migrate, resize or migration.

14:01:52.781722 ARP, Reply 62.28.227.53 is-at fa:16:3e:da:0f:b3 (oui Unknown), length 28

I think the package send out the destination host need Gratuitous ARP (broadcast) like:

14:05:39.165559 ARP, Request who-has 62.28.227.53 (Broadcast) tell 62.28.227.53, length 46

I do manual command below to get broadcast GARP to physical switch update ARP table caching and after that connection to VM comeback normal.

ip netns exec fip-fd01c38f-3604-4302-b5c8-6354522eb030 arping -U -I fg-51508611-a5 -c 1 62.28.227.53

Thanks for helping me.
Luan.

Revision history for this message
Luan Nguyen (nguyenhuuluan434-w) wrote :

Hi all,
More info about this problem, we view the code in the function send GARP when finish move VM (migrate, live-migration or resize) the command arping using option -A

def _arping(ns_name, iface_name, address, count):
    # Pass -w to set timeout to ensure exit if interface removed while running
    arping_cmd = ['arping', '-A', '-I', iface_name, '-c', count,
                  '-w', 1.5 * count, address]
    try:
        ip_wrapper = IPWrapper(namespace=ns_name)
        ip_wrapper.netns.execute(arping_cmd, check_exit_code=True)
    except Exception:
        msg = _LE("Failed sending gratuitous ARP "
                  "to %(addr)s on %(iface)s in namespace %(ns)s")
        LOG.exception(msg, {'addr': address,
                            'iface': iface_name,
                            'ns': ns_name})

with this option the destination host and all l3 agent in system will send ARP REPLY packets same:
14:01:52.781722 ARP, Reply 62.28.227.53 is-at fa:16:3e:da:0f:b3 (oui Unknown), length 28

but with my physical switch don't update ARP.

I think the step after finished move VM, the l3 agent need send 2 type of ARP
 + the first with option -A of arping command to send ARP REPLY packets.
 + the second with option -U of arping command to send ARP REQUEST (broadcast).

because depend on physical switch that accept type of ARP to update ARP table.

This problem occurs when we do some step below:
 + one floating ip attach to instance A
 + ping to floating ip ok.
 + detach this floating IP
 + attach this floating IP to instance B (on another compute-host with instance A)
 + ping floating fail until physical switch update (by manual do arping command with option -U or arp table on switch expire)

Thanks for helping,
Luan.

description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.