StarlingX

Containers: vm unreachable for minutes after live migration or vm reboot

Bug #1818118 reported by Yang Liu on 2019-02-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Joseph Richard

Bug Description

Brief Description
-----------------
vm cannot be reached for minutes after live migration or reboot

Severity
--------
Major

Steps to Reproduce
------------------
1. Launch a vm, ensure it's reachable from external
2. Live migrate the vm
3. Soft reboot or cold migrate the vm

Expected Behavior
------------------
2. Networking outage for vm during live migration should be very short, seconds the most.
3. VM should be reachable shortly after reboot completes.

Actual Behavior
----------------
2. VM is unreachable for a few minutes
3. After reboot completes, it could take more than 5 minutes for the vm to get an IP again. Once vm got an IP, it becomes reachable fairly quickly.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-16

Timestamp/Logs
--------------
# Cold migrate:
[2019-02-27 21:34:23,886] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne migrate --poll 9ca94fb1-e2c6-4cc0-9425-0939d2985691'
[2019-02-27 21:35:08,159] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne resize-confirm 9ca94fb1-e2c6-4cc0-9425-0939d2985691'

# Live migrate:
[2019-02-27 21:41:53,883] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration f3f604f4-63fd-4107-b6b9-b5a5d237866e'

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-03-01:

Marking as release gating; possibly a neutron or vswitch issue. Needs further investigation.

Changed in starlingx:
assignee:	nobody → Joseph Richard (josephrichard)
importance:	Undecided → High
status:	New → Triaged
tags:	added: stx.2019.05 stx.containers stx.networking

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-03-07:

Currently, this issue seems to be reproducible in only one system.
The recipe to reproduce is to do a VM live migration, followed by a reboot.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-03-18:

Reducing the priority to medium since this issue is only reported in one system only.

Changed in starlingx:
importance:	High → Medium

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Ghada Khalil (gkhalil) on 2019-04-09

tags:

added: stx.retestneeded

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-07-16:

When is the last time that this behaviour has been observed?

Frank Miller (sensfan22) on 2019-07-16

tags:

removed: stx.containers

Revision history for this message

Yang Liu (yliu12) wrote on 2019-07-23:

I'm seeing similar issue in recent nova regressions, where vm becomes unreachable after live-migration.
The last time seen was on dedicated storage system wcp113-121 after live migrating a ubuntu guest on load 20190720T013000Z.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-25:

@Yang, There are reports that live migration is not working:
https://bugs.launchpad.net/starlingx/+bug/1837759
https://bugs.launchpad.net/starlingx/+bug/1830915

Are you not hitting these issues in your testing? Does the VM actually successfully live migrate to another host?

Revision history for this message

Yang Liu (yliu12) wrote on 2019-07-25:

I did not see consistent live-migration failures in latest nova regression last weekend, which performed 25+ live migrations on various guest on standard and storage systems with local_image and remote instance backends.

However, when I searched for results on the same system where 1837759 is seen (wcp92-98), it seems live-migration has been failing and vms never moved to another host.
Also just for reference, the same system (NVMe with journal disk) suffers from 1837242, but when the live-migration failure happened in 1837759, the system seemed to be in healthy state.

Maria Guadalupe Perez Ibara (maria-gp) on 2019-07-25

tags:

added: stx.regression

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-26:

Please attach the logs from the most recent reproduction of this issue.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-26:

It would also be good to know of the 25+ live migrations in the nova regression, how many hit this connectivity issue?

Revision history for this message

Yang Liu (yliu12) wrote on 2019-07-29:

#10

ALL_NODES_20190728.193427.tar Edit (128.3 MiB, application/x-tar)

Adding logs from latest nova regressions on 2+3 system wolfpass3-7 with 20190727T013000Z load.
We have about 9 out of 60 ping failures after live-migrations.

1st failure:
[2019-07-28 01:36:59,804] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 0b9a3e2a-a3f7-49f0-9fc4-411bd0da8f66'

[2019-07-28 01:40:43,773] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.85.67'
PING 192.168.85.67 (192.168.85.67) 56(84) bytes of data.

--- 192.168.85.67 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2015ms

last failure:
[2019-07-28 17:57:00,609] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 977d71b8-dff6-4dd5-a6b1-f75ec7ae1ddf'

[2019-07-28 18:00:32,098] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.85.74'
PING 192.168.85.74 (192.168.85.74) 56(84) bytes of data.

--- 192.168.85.74 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2000ms

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#11

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-18:

#12

Requesting a re-test of this issue with openstack train

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Peng Peng (ppeng) wrote on 2019-12-03:

#13

Issue was not reproduced in 35 runs on
Lab: WCP_3_6
Load: 2019-12-01_20-00-00

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-05:

#14

Closing; issue is not reproducible

Changed in starlingx:
status:	Incomplete → Invalid

Yang Liu (yliu12) on 2020-01-30

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

ALL_NODES_20190728.193427.tar Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.