live-migration causes VM network disconnected forever

Bug #1367999 reported by Li Ma
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Darragh O'Reilly
Juno
Fix Released
High
Arnaud Morin

Bug Description

OS: RHEL 6.5
OpenStack: RDO icehouse and master
Neutron: Linuxbridge + VxLAN + L2pop
Testbed: 1 controller node + 2 compute nodes + 1 network node

Reproduction procedure:

1. Start to ping VM from qrouter namespace using fixed IP
    Start to ping VM from outside using floating IP

2. Live-migrate the VM from compute1 to compute2

3. VM Network disconnects after several seconds

4. Even if Nova reports that the migration is finished,
Ping is still not working.

Debug Info on network node:

Command: ['sudo', 'bridge', 'fdb', 'add', 'fa:16:3e:b3:fd:27', 'dev', 'vxlan-1', 'dst', '192.168.2.103']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: File exists\n'

Cause:
Before migration, the original fdb entry is there. After migration, l2pop will updates the fdb entry of the VM.
It adds the new entry that causes ERROR.

The right operation should be 'replace' not 'add'.

By the way, 'replace' will safely add the new entry if old entry is not existed.

I think this bug can be marked as High.

Li Ma (nick-ma-z)
Changed in neutron:
assignee: nobody → Li Ma (nick-ma-z)
description: updated
Li Ma (nick-ma-z)
description: updated
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Please attach more logs at agent side to give more context

tags: added: ovs
Changed in neutron:
importance: Undecided → High
tags: added: lb
removed: ovs
Changed in neutron:
status: New → Incomplete
Revision history for this message
Mathieu Rohon (mathieu-rohon) wrote :

hi li ma,

do you still have the bug in Juno?

this should not happen because fdb entry is removed when the new lb_agent is calling for port details (the port is moving to BUILD state). The new fdb entry is added when the new lb_agent is telling that the port is UP again.

tags: added: l2-pop
Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

sometimes I see fdb entries being learned before the static entry is added. Then presence of the learned entry will cause the addition of the static entry to fail with a message like above. Things work for a while until the learned entry gets removed.

Revision history for this message
Mathieu Rohon (mathieu-rohon) wrote :

humm, intersting!

which version are you using? do you have logs?
In your case, it happens also at live-migration?

a possible scenario is that your bridge is performing Mac learning because it receives a packet before receiving the the fdb message from l2pop RPC. Then the learned entry gets outdated, it has been learned so it is not a permanent one.
But since linuxbridge is activating ARP responder, the Mac learning won't happen again... and your bridge doesn't have the fdb entry anymore.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/165137

Changed in neutron:
assignee: Li Ma (nick-ma-z) → Darragh O'Reilly (darragh-oreilly)
status: Incomplete → In Progress
tags: added: icehouse-backport-potential juno-backport-potential
tags: added: kilo-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/165137
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1476ee63ad9251d8956f6c3c9aece18c351a3ba9
Submitter: Jenkins
Branch: master

commit 1476ee63ad9251d8956f6c3c9aece18c351a3ba9
Author: Darragh O'Reilly <email address hidden>
Date: Tue Mar 17 16:03:51 2015 +0000

    lb-agent: use 'replace' instead of 'add' with 'bridge fdb'

    l2pop on the linuxbridge agent can fail to add permanent entries
    because the 'bridge fdb add' command fails if a temporary entry
    exists. This patch uses 'replace' which always works.

    Closes-Bug: 1367999
    Change-Id: I4371f508ad23d96de950634b4a90218ea474f3f0

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/167948

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/167948
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9e01647ad328301b6e028752636080efa1936bf4
Submitter: Jenkins
Branch: stable/juno

commit 9e01647ad328301b6e028752636080efa1936bf4
Author: Darragh O'Reilly <email address hidden>
Date: Tue Mar 17 16:03:51 2015 +0000

    lb-agent: use 'replace' instead of 'add' with 'bridge fdb'

    l2pop on the linuxbridge agent can fail to add permanent entries
    because the 'bridge fdb add' command fails if a temporary entry
    exists. This patch uses 'replace' which always works.

    Closes-Bug: 1367999
    Change-Id: I4371f508ad23d96de950634b4a90218ea474f3f0
    (cherry picked from commit 1476ee63ad9251d8956f6c3c9aece18c351a3ba9)

tags: added: in-stable-juno
Kyle Mestery (mestery)
Changed in neutron:
milestone: none → kilo-rc1
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-rc1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.