Containers: vm unreachable for minutes after live migration or vm reboot

Bug #1818118 reported by Yang Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Joseph Richard

Bug Description

Brief Description
-----------------
vm cannot be reached for minutes after live migration or reboot

Severity
--------
Major

Steps to Reproduce
------------------
1. Launch a vm, ensure it's reachable from external
2. Live migrate the vm
3. Soft reboot or cold migrate the vm

Expected Behavior
------------------
2. Networking outage for vm during live migration should be very short, seconds the most.
3. VM should be reachable shortly after reboot completes.

Actual Behavior
----------------
2. VM is unreachable for a few minutes
3. After reboot completes, it could take more than 5 minutes for the vm to get an IP again. Once vm got an IP, it becomes reachable fairly quickly.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-16

Timestamp/Logs
--------------
# Cold migrate:
[2019-02-27 21:34:23,886] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne migrate --poll 9ca94fb1-e2c6-4cc0-9425-0939d2985691'
[2019-02-27 21:35:08,159] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne resize-confirm 9ca94fb1-e2c6-4cc0-9425-0939d2985691'

# Live migrate:
[2019-02-27 21:41:53,883] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration f3f604f4-63fd-4107-b6b9-b5a5d237866e'

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; possibly a neutron or vswitch issue. Needs further investigation.

Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
importance: Undecided → High
status: New → Triaged
tags: added: stx.2019.05 stx.containers stx.networking
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Currently, this issue seems to be reproducible in only one system.
The recipe to reproduce is to do a VM live migration, followed by a reboot.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Reducing the priority to medium since this issue is only reported in one system only.

Changed in starlingx:
importance: High → Medium
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Joseph Richard (josephrichard) wrote :

When is the last time that this behaviour has been observed?

Frank Miller (sensfan22)
tags: removed: stx.containers
Revision history for this message
Yang Liu (yliu12) wrote :

I'm seeing similar issue in recent nova regressions, where vm becomes unreachable after live-migration.
The last time seen was on dedicated storage system wcp113-121 after live migrating a ubuntu guest on load 20190720T013000Z.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Yang, There are reports that live migration is not working:
https://bugs.launchpad.net/starlingx/+bug/1837759
https://bugs.launchpad.net/starlingx/+bug/1830915

Are you not hitting these issues in your testing? Does the VM actually successfully live migrate to another host?

Revision history for this message
Yang Liu (yliu12) wrote :

I did not see consistent live-migration failures in latest nova regression last weekend, which performed 25+ live migrations on various guest on standard and storage systems with local_image and remote instance backends.

However, when I searched for results on the same system where 1837759 is seen (wcp92-98), it seems live-migration has been failing and vms never moved to another host.
Also just for reference, the same system (NVMe with journal disk) suffers from 1837242, but when the live-migration failure happened in 1837759, the system seemed to be in healthy state.

tags: added: stx.regression
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Please attach the logs from the most recent reproduction of this issue.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

It would also be good to know of the 25+ live migrations in the nova regression, how many hit this connectivity issue?

Revision history for this message
Yang Liu (yliu12) wrote :

Adding logs from latest nova regressions on 2+3 system wolfpass3-7 with 20190727T013000Z load.
We have about 9 out of 60 ping failures after live-migrations.

1st failure:
[2019-07-28 01:36:59,804] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 0b9a3e2a-a3f7-49f0-9fc4-411bd0da8f66'

[2019-07-28 01:40:43,773] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.85.67'
PING 192.168.85.67 (192.168.85.67) 56(84) bytes of data.

--- 192.168.85.67 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2015ms

last failure:
[2019-07-28 17:57:00,609] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 977d71b8-dff6-4dd5-a6b1-f75ec7ae1ddf'

[2019-07-28 18:00:32,098] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.85.74'
PING 192.168.85.74 (192.168.85.74) 56(84) bytes of data.

--- 192.168.85.74 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2000ms

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Requesting a re-test of this issue with openstack train

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was not reproduced in 35 runs on
Lab: WCP_3_6
Load: 2019-12-01_20-00-00

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing; issue is not reproducible

Changed in starlingx:
status: Incomplete → Invalid
Yang Liu (yliu12)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.