Containers: vm unreachable for minutes after live migration or vm reboot

Bug #1818118 reported by Yang Liu on 2019-02-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Joseph Richard

Bug Description

Brief Description
-----------------
vm cannot be reached for minutes after live migration or reboot

Severity
--------
Major

Steps to Reproduce
------------------
1. Launch a vm, ensure it's reachable from external
2. Live migrate the vm
3. Soft reboot or cold migrate the vm

Expected Behavior
------------------
2. Networking outage for vm during live migration should be very short, seconds the most.
3. VM should be reachable shortly after reboot completes.

Actual Behavior
----------------
2. VM is unreachable for a few minutes
3. After reboot completes, it could take more than 5 minutes for the vm to get an IP again. Once vm got an IP, it becomes reachable fairly quickly.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-16

Timestamp/Logs
--------------
# Cold migrate:
[2019-02-27 21:34:23,886] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne migrate --poll 9ca94fb1-e2c6-4cc0-9425-0939d2985691'
[2019-02-27 21:35:08,159] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne resize-confirm 9ca94fb1-e2c6-4cc0-9425-0939d2985691'

# Live migrate:
[2019-02-27 21:41:53,883] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration f3f604f4-63fd-4107-b6b9-b5a5d237866e'

Ghada Khalil (gkhalil) wrote :

Marking as release gating; possibly a neutron or vswitch issue. Needs further investigation.

Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
importance: Undecided → High
status: New → Triaged
tags: added: stx.2019.05 stx.containers stx.networking
Ghada Khalil (gkhalil) wrote :

Currently, this issue seems to be reproducible in only one system.
The recipe to reproduce is to do a VM live migration, followed by a reboot.

Ghada Khalil (gkhalil) wrote :

Reducing the priority to medium since this issue is only reported in one system only.

Changed in starlingx:
importance: High → Medium
Ken Young (kenyis) on 2019-04-05
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil) on 2019-04-09
tags: added: stx.retestneeded
Joseph Richard (josephrichard) wrote :

When is the last time that this behaviour has been observed?

Frank Miller (sensfan22) on 2019-07-16
tags: removed: stx.containers
Yang Liu (yliu12) wrote :

I'm seeing similar issue in recent nova regressions, where vm becomes unreachable after live-migration.
The last time seen was on dedicated storage system wcp113-121 after live migrating a ubuntu guest on load 20190720T013000Z.

Ghada Khalil (gkhalil) wrote :

@Yang, There are reports that live migration is not working:
https://bugs.launchpad.net/starlingx/+bug/1837759
https://bugs.launchpad.net/starlingx/+bug/1830915

Are you not hitting these issues in your testing? Does the VM actually successfully live migrate to another host?

Yang Liu (yliu12) wrote :

I did not see consistent live-migration failures in latest nova regression last weekend, which performed 25+ live migrations on various guest on standard and storage systems with local_image and remote instance backends.

However, when I searched for results on the same system where 1837759 is seen (wcp92-98), it seems live-migration has been failing and vms never moved to another host.
Also just for reference, the same system (NVMe with journal disk) suffers from 1837242, but when the live-migration failure happened in 1837759, the system seemed to be in healthy state.

tags: added: stx.regression
Ghada Khalil (gkhalil) wrote :

Please attach the logs from the most recent reproduction of this issue.

Ghada Khalil (gkhalil) wrote :

It would also be good to know of the 25+ live migrations in the nova regression, how many hit this connectivity issue?

Yang Liu (yliu12) wrote :

Adding logs from latest nova regressions on 2+3 system wolfpass3-7 with 20190727T013000Z load.
We have about 9 out of 60 ping failures after live-migrations.

1st failure:
[2019-07-28 01:36:59,804] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 0b9a3e2a-a3f7-49f0-9fc4-411bd0da8f66'

[2019-07-28 01:40:43,773] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.85.67'
PING 192.168.85.67 (192.168.85.67) 56(84) bytes of data.

--- 192.168.85.67 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2015ms

last failure:
[2019-07-28 17:57:00,609] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 977d71b8-dff6-4dd5-a6b1-f75ec7ae1ddf'

[2019-07-28 18:00:32,098] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.85.74'
PING 192.168.85.74 (192.168.85.74) 56(84) bytes of data.

--- 192.168.85.74 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2000ms

Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Ghada Khalil (gkhalil) wrote :

Requesting a re-test of this issue with openstack train

Changed in starlingx:
status: Triaged → Incomplete
Peng Peng (ppeng) wrote :

Issue was not reproduced in 35 runs on
Lab: WCP_3_6
Load: 2019-12-01_20-00-00

Ghada Khalil (gkhalil) wrote :

Closing; issue is not reproducible

Changed in starlingx:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers