live migration does not coordinate VM resume with network readiness

Bug #1511430 reported by Miguel Angel Ajo
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
Unassigned

Bug Description

When migrating a VM from one host to another in combination with neutron, VM can resume at destination host while network is not ready (race condition)

QEMU has a mechanism to send a few RARPs once migration is done and before resuming.

Nova needs to coordinate with Qemu and neutron (nova/neutron notification mechanism) to make sure VM is only resumed at destination host when networking has been properly wired, otherwise the RARPs are lost, and connectivity to the VM is disrupted until the VM sends any broadcast message.

log detail (merged from two hosts logs and tcpdumps)

migration from host 29 to 30

2015-10-29 10:54:27.592000 [VMLIFE30] 21476 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Resumed (Lifecycle Event)
2015-10-29 10:54:27.609000 [VMLIFE29] 29022 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Paused (Lifecycle Event)
2015-10-29 10:54:27.636000 [TAP30] tcpdump DEBUG 10:54:27.632047 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46
2015-10-29 10:54:27.656000 [TAP29] tcpdump DEBUG tcpdump: pcap_loop: The interface went down

2015-10-29 10:54:27.787000 [TAP30] tcpdump DEBUG 10:54:27.783353 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46

2015-10-29 10:54:27.818000 [FDB30] ovs-fdb DEBUG 62 0 fa:16:3e:50:a3:46 0 # switch associated to VLAN 0, should be "1", still not tagged, also not propagated to other hosts because vlan0 is invalid in the OVS implementation

2015-10-29 10:54:28.037000 [TAP30] tcpdump DEBUG 10:54:28.033259 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46

2015-10-29 10:54:28.387000 [TAP30] tcpdump DEBUG 10:54:28.383211 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46

2015-10-29 10:54:28.969000 [VMLIFE29] 29022 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Stopped (Lifecycle Event)

2015-10-29 10:54:29.803000 [OVS30] 21310 DEBUG neutron.agent.linux.utils [req-a33468a6-f259-4324-a132-ab0dd025eeec None]
                                        Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qvo2e6d0f35-cb', 'tag=1'] # wiring is now ready, and after this neutron-openvswitch-agent will notify neutron-server which could notify nova about readiness...

A reproduction ansible script is provided to show how it happens:

https://github.com/mangelajo/oslogmerger/blob/master/contrib/debug-live-migration/debug-live-migration.yaml

And complete merged output with oslogmerger can be found here:
https://raw.githubusercontent.com/mangelajo/oslogmerger/master/contrib/debug-live-migration/logs/mergedlogs-packets-ovs.log

Changed in nova:
status: New → Confirmed
Michael Still (mikal)
tags: added: live-migration
Changed in nova:
importance: Undecided → Low
Changed in nova:
assignee: nobody → Mohammed Ashraf (mohammed-asharaf)
status: Confirmed → In Progress
Changed in nova:
assignee: Mohammed Ashraf (mohammed-asharaf) → nobody
Changed in nova:
status: In Progress → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.