Live migrate of iscsi-backed VM loses internal network connectivity
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Opinion
|
Undecided
|
Unassigned | ||
neutron |
New
|
Undecided
|
Unassigned |
Bug Description
Description
===========
Note that this may be a Neutron issue, but since it is happening during live migration, I wanted to point it out to the Nova group first, and let them decide whether to include the Neutron group on this ticket.
Also note that this may not be related to iSCSI at all - I just don't have access to Ceph-backed VMs at the moment to test.
Live migration of a VM that uses an iSCSI-backed volume-based boot disk (no other disks attached) will migrate correctly, including the volume, and DVR router functionality with floating IPs, but internal network connectivity won't work (pings between VMs on the same Neutron network fail).
After live migrating the "bad" VM back to the original host, internal networking works again!
NOTE - this seems to be only reproducible if you deploy the VMs, do "not" ping between the VMs, migrate one of the VMs, and "then" ping between the VMs. The ping fails in this case. In the case where pings are performed "prior" to migration, the pings succeed!
So, it appears that something in Neutron isn't being migrated.
I had tested this configuration back in the Liberty days and ran into the same issue, and thought it was possibly a bug that was fixed by now, but it looks like the problem still exists.
Note that I'm still looking at logs to determine whether there is good evidence for why/when this happens, but wanted to get a bug report placed in case it was a known issue.
Steps to reproduce
==================
Deploy 2 VMs with an internal network, each with floating IPs, with security groups that are not very restrictive (allow everything including pings between VMs and the Internet).
In our case, the two VMs were deployed on separate physical hosts.
If VM #2 resides on physical host compute002 after deployment, live migrate this VM to physical host compute003 with:
openstack server migrate --live compute003 d3d45afb-
From VM #2, ping VM #1. There is no ping response.
If you perform all of the above, but ping between the VMs "prior" to migration, pings work fine after migrations (hiding the issue).
Expected result
===============
Network should function correctly after a migration - pings should work, for example, between VMs.
Actual result
=============
Testing with VM to VM pings: pings are lost and connectivity "never" resumes. I deployed the 2 VMs, migrated one of them, and started a ping from one VM to the other, waited 16+ minutes, and pings are still failing.
Perform a live migrate of VM #2 back to the original host using:
openstack server migrate --live compute002 d3d45afb-
and pings start to work again.
Perform a live migrate of VM #2 to the same host as VM #1 and pings between VMs "also" work!
Environment
===========
stable/rocky deployment with Kolla-Ansible 7.0.0.0rc3devXX (the latest as of October 15th, 2018) and Kolla 7.0.0.0rc3devXX
CentOS 7.5 with latest updates as of October 15, 2018.
Kernel: Linux 4.18.14-
Hypervisor: KVM
Storage: Blockbridge (unsupported, but functions the same as other iSCSI based backends)
Networking: DVR with OpenVSwitch
tags: | added: live-migration neutron |
Shortly after submitting this, I ran into the same situation, but when I had initially started a ping between the VMs. So, the problem is more severe. I will provide logs shortly after I have some time to review.