VM's inaccessible after live migration on certain Arista VXLAN Flood and Learn fabrics
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Description
===========
This is not a Nova bug per se, but rather an issue with Arista and potentially other network fabrics.
I have observed a case where VMs are inaccessible by network traffic after live migrating on certain fabrics, in this case, Arista VXlan, despite the hypervisor sending out a number of garp packets following a live migration.
This was observed on an Arista VXlan fabric - live migrating a VM between hypervisors on two different switches. A live migration between two hypervisors on the same switch is not affected.
In both cases, I can see garps on the wire triggered by a VM being live migrated, these packets have been observed from other hypervisors and even other VMs in the same VLAN on different hypervisors.
The VM is accessible after a period of time, at the point the switch arp aging timer resets and the MAC is re-learnt on the correct switch.
This occurs on any VM - even a simple c1.m1 with no active workload, backed by Ceph storage.
Steps to Reproduce
===========
To try and prevent this from happening, I have tested the libvirt: Add announce-self post live-migration workaround patch[0] - despite this, the issue was still observed.
Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no active or significant load on VM
Run:
`ping VM_IP | while read ping; do echo "$(date): $pong"; done`
Then:
`openstack server migrate --live TARGET_HOST VM_INSTANCE`
Expected result
===============
VM live migrates and is accessible in a reasonable <10 timeframe
Actual result
=============
VM live migrates successfully, ping fails until switch arp timer resets (in our environment, 60-180 seconds)
Despite efforts from us and our network team, we are unable to determine why the VM is inaccessible, what has been noticed is that sending a further number of announce_self commands to the qemu monitor, triggering more garps, gets the VM into an accessible state in an acceptable time of <5 seconds.
Environment
=============
Arista EOS4.26M VXLan fabric
OpenStack Nova Train, Ussuri, Victoria (with and without patch
Ceph Nautlius
OpenStack provider networking, using VLANs
Patch/Workaround
=============
I have a follow-up workaround patch which builds on the announce-self patch prepared which we have been running in our production deployment.
This patch adds two configurable options and the associated code:
`enable_
`enable_
My tests of nearly 5000 live migrations show that the optimal settings in our environment are 3 additional calls to qemu_announce_self with 1 second delay - this gets out VMs accessible in 2 or 3 seconds in the vast majority of cases, and 99% within 5 seconds after they stop responding to ping (the point at which we determine they are inaccessible).
I shall be submitting this patch for review by the Nova community in the next few days.
0: https:/
Fix proposed to branch: master /review. opendev. org/c/openstack /nova/+ /867324
Review: https:/