neutron

Live migrate of iscsi-backed VM loses internal network connectivity

Bug #1798690 reported by Eric Miller on 2018-10-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Opinion	Undecided	Unassigned
	neutron	New	Undecided	Unassigned

Bug Description

Description
===========

Note that this may be a Neutron issue, but since it is happening during live migration, I wanted to point it out to the Nova group first, and let them decide whether to include the Neutron group on this ticket.

Also note that this may not be related to iSCSI at all - I just don't have access to Ceph-backed VMs at the moment to test.

Live migration of a VM that uses an iSCSI-backed volume-based boot disk (no other disks attached) will migrate correctly, including the volume, and DVR router functionality with floating IPs, but internal network connectivity won't work (pings between VMs on the same Neutron network fail).

After live migrating the "bad" VM back to the original host, internal networking works again!

NOTE - this seems to be only reproducible if you deploy the VMs, do "not" ping between the VMs, migrate one of the VMs, and "then" ping between the VMs. The ping fails in this case. In the case where pings are performed "prior" to migration, the pings succeed!

So, it appears that something in Neutron isn't being migrated.

I had tested this configuration back in the Liberty days and ran into the same issue, and thought it was possibly a bug that was fixed by now, but it looks like the problem still exists.

Note that I'm still looking at logs to determine whether there is good evidence for why/when this happens, but wanted to get a bug report placed in case it was a known issue.

Steps to reproduce
==================

Deploy 2 VMs with an internal network, each with floating IPs, with security groups that are not very restrictive (allow everything including pings between VMs and the Internet).

In our case, the two VMs were deployed on separate physical hosts.

If VM #2 resides on physical host compute002 after deployment, live migrate this VM to physical host compute003 with:
openstack server migrate --live compute003 d3d45afb-e913-4cb7-89df-a1c1d51d6339

From VM #2, ping VM #1. There is no ping response.

If you perform all of the above, but ping between the VMs "prior" to migration, pings work fine after migrations (hiding the issue).

Expected result
===============

Network should function correctly after a migration - pings should work, for example, between VMs.

Actual result
=============

Testing with VM to VM pings: pings are lost and connectivity "never" resumes. I deployed the 2 VMs, migrated one of them, and started a ping from one VM to the other, waited 16+ minutes, and pings are still failing.

Perform a live migrate of VM #2 back to the original host using:
openstack server migrate --live compute002 d3d45afb-e913-4cb7-89df-a1c1d51d6339

and pings start to work again.

Perform a live migrate of VM #2 to the same host as VM #1 and pings between VMs "also" work!

Environment
===========

stable/rocky deployment with Kolla-Ansible 7.0.0.0rc3devXX (the latest as of October 15th, 2018) and Kolla 7.0.0.0rc3devXX

CentOS 7.5 with latest updates as of October 15, 2018.

Kernel: Linux 4.18.14-1.el7.elrepo.x86_64

Hypervisor: KVM

Storage: Blockbridge (unsupported, but functions the same as other iSCSI based backends)

Networking: DVR with OpenVSwitch

Tags:

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-10-18:

Shortly after submitting this, I ran into the same situation, but when I had initially started a ping between the VMs. So, the problem is more severe. I will provide logs shortly after I have some time to review.

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-10-18:

Looking at the OVS configurations (with ovs-vsctl show), the flows configured on the source host still exist, and there are "no" flows configured on the target host.

So, this is definitely a case where Neutron is not creating the DVR configuration on the target host, much less removing it from the source host.

Takashi Natsume (natsume-takashi) on 2018-10-19

tags:

added: live-migration neutron

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-10-19:

Just a quick note that I have confirmed that this issue happens on Ceph-backed volumes as well, so this is not iSCSI backend related.

Also, I am now faced with an issue where, after migration, my script that destroys my test environment has left the boot volume attached to a deleted VM. The delete server succeed, but apparently it did not detach the volume. "openstack volume list" shows the volume is "in-use" and "Attached to 2ca87594-f5b5-438e-a81c-5a87c7b4a917 on /dev/vda", even though this VM no longer exists.

So, maybe there is also a volume detach issue after live migration to a host that is not its original power-on host? The volume is not attached to any compute node, at least as far as I can tell. I run "sudo rbd showmapped" in the nova_libvirt container on all hosts and it returns nothing. Is this the correct procedure? The volume still exists in Ceph ("sudo rbd list --pool volumes" shows the one volume).

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-10-19:

Download full text (3.7 KiB)

I ran a few tests, trying to reproduce the problem - see below.

I tried 2 flavors, thinking that a small vCPU VM may behave differently, since all of my previous tests were with large VMs. But, this doesn't appear to have any effect since Tests #2 and #3 below show the issue.

Note that Test #1 succeeded, behaving exactly as I would expect.

Note in Test #3, the floating IP stopped responding to pings for over 200 seconds. I didn't have a timer for this, so it could have been 300 seconds, or the time for typical ARP entries to expire. Maybe a gratuitous arp is getting missed somewhere? Since internal traffic is tunneled, it seems unlikely.

Eric

VM flavors:
c5.xlarge = 4 vCPUs and 8GiB of RAM
c5.8xlarge = 32 vCPUs and 64GiB of RAM
VM operating system: CentOS 7.5 (1808)
Bash scripts are used for creation and destruction of environment

Test #1 (on Ceph storage) - using c5.xlarge
-------------------------------------------
Create 2 volumes from the CentOS 7.5 (1808) image marked as bootable
Deploy 2 VMs in a domain in a project with the above volumes, respectively
Create an internal network and router and attach router to internal and external network
Assign security groups and a floating IP to both VMs, respectively
Power on 2 VMs
Verify that both VMs are on different physical hosts
Login to VM #1 using floating IP
From VM #1, ping VM #2 on internal network continuously
Ping succeeds
Live migrate VM #2 to a 3rd host
Ping succeeds
Live migrate VM #2 back to original host
Ping succeeds
Live migrate VM #2 back to the 3rd host
Ping succeeds
Ping VM #1 floating IP from outside host
Live migrate VM #2 back to original host
Ping succeeds
Destroy test environment

Test #2 (on Ceph storage) - using c5.8xlarge
--------------------------------------------
Create 2 volumes from the CentOS 7.5 (1808) image marked as bootable
Deploy 2 VMs in a domain in a project with the above volumes, respectively
Create an internal network and router and attach router to internal and external network
Assign security groups and a floating IP to both VMs, respectively
Power on 2 VMs
Verify that both VMs are on different physical hosts
Ping VM #2 floating IP from outside host
Login to VM #1 using floating IP
From VM #1, ping VM #2 on internal network continuously
Live migrate VM #2 to a 3rd host
ERROR: internal ping fails, floating IP ping succeeds
Live migrate VM #2 back to original host
Internal ping resumes, floating IP ping resumes
Live migrate VM #2 to the 3rd host
ERROR: internal ping fails, floating IP ping succeeds
Live migrate VM #2 back to original host
Internal ping resumes, floating IP ping resumes
Destroy test environment

I ran a few tests, trying to reproduce the problem - see below.

I tried 2 flavors, thinking that a small vCPU VM may behave differently, since all of my previous tests were with large VMs.  But, this doesn't appear to have any effect since Tests #2 and #3 below show the issue.

Note that Test #1 succeeded, behaving exactly as I would expect.

Note in Test #3, the floating IP stopped responding to pings for over 200 seconds.  I didn't have a timer for this, so it could have been 300 seconds, or the time for typical ARP entries to expire.  Maybe a gratuitous arp is getting missed somewhere?  Since internal traffic is tunneled, it seems unlikely.

Eric

VM flavors:
  c5.xlarge = 4 vCPUs and 8GiB of RAM
  c5.8xlarge = 32 vCPUs and 64GiB of RAM
VM operating system:  CentOS 7.5 (1808)
Bash scripts are used for creation and destruction of environment

Test #3 (on Ceph storage) - using c5.xlarge (again)
---------------------------------------------------
Create 2 volumes from the CentOS 7.5 (1808) image marked as bootable
Deploy 2 VMs in a domain in a project with the above volumes, respectively
Create an internal network and router and attach router to internal and external network
Assign security groups and a floating IP to both VMs, respectively
Power on 2 VMs
Verify that both VMs are on different physical hosts
Ping VM #2 floating IP from outside host
Login to VM #1 using floating IP
From VM #1, ping VM #2 on internal network continuously
Live migrate VM #2 to a 3rd host
ERROR: Internal ping succeeds, floating IP ping fails
Live migrate VM #2 back to original host
ERROR: Internal ping succeeds, floating IP ping fails
(about 200 seconds goes by)
Floating IP ping works again!
Live migrate VM #2 back to the 3rd host
Both internal and floating IP pings succeed
Live migrate VM #2 back to original host
Both internal and floating IP pings succeed
Live migrate VM #2 back to the 3rd host
Both internal and floating IP pings succeed
Destroy test environment

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-10-28:

Added to Neutron group. This is easily reproducible, but if more information is needed, please let me know.

Thanks!

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-11-05:

Just checking-in and to also see if anyone else has had success with live migrations? I'm just trying to narrow down whether we have an environment issue here that may not happen to anyone else.

Thanks!

Eric

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2018-11-06:

i have not looked into this closely but my guess is this could be related to the arp suppression rules
used in the dvr case not being updated correctly that said it is just a guess
so there may be something more going on here.

eric can you try doing a hard reboot of the migrated instance and see if that corrects the connectivity to the internal network ips.

it would also be helpful to know if you are using the iptabels firewall or openvswtich firewall
and if following the migration the port status and port admin status are active/up?

i may not have time to help futher but ill try and check in on this bug again in a few days.
from a nova perspective i dont think this is a nova but or os-vif for that matter but i have been investaging live migration related issues this cycle and this is yet another edgecase that apperars
to need fixing.

Changed in nova:
status:	New → Opinion

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-11-07:

Thanks Sean! I will have to look at the hosts again, but it appears the OVS configuration does not get migrated from source to target host. Instead, the OVS configuration remains on the source host. I was able to migrate the VM "back" to the source and the internal network connectivity started to work again. I was looking at the OVS rules, and the rules (flows) still exist in the vswitch on the source host.

I will try the hard reboot and see if it does anything.

Kolla Ansible, by default, installs OVS, but I'm not positive whether OVS is used for security groups. I tend to remember looking at this a while back and believe it has Neutron configure iptables. Is there anything specific in the configuration that I can check for you?

I will check on the port status and admin status, but I'm almost positive the port (in the guest) is "up up". I have seen floating IPs work, but internal IPs "not" work, and both use the same port, thus why I say this, but I will confirm exactly in my next round of tests.

I have another bug submitted that is a timing bug during a cold migration - see here:
https://bugs.launchpad.net/nova/+bug/1799309

So, maybe there is a timing issue during live migration too, that produces no error, but fails to migrate OVS flows?

Eric

Revision history for this message

ignazio (cassano) wrote on 2020-04-28:

Hello, I have same issue migrating from queens to rocky.
Works fine on queens.
Does not work on rocky
Does non work on stein
Ignazio

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.