VM remains on original host after live migration issued

Bug #1830915 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
zhipeng liu

Bug Description

Brief Description
-----------------
2 mins after live-migrate, VM is still in the original host.

Severity
--------
Major

Steps to Reproduce
------------------

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP-1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-28_17-05-57

Last Pass
---------
2019-05-24_17-39-51

Timestamp/Logs
--------------
[2019-05-29 10:34:51,432] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show eecdb099-1e59-435c-91ed-16119c00851e'
[2019-05-29 10:34:53,933] 387 DEBUG MainThread ssh.expect :: Output:
+-------------------------------------+-----------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | controller-0 |

[2019-05-29 10:34:54,037] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration eecdb099-1e59-435c-91ed-16119c00851e'

[2019-05-29 10:36:33,581] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show eecdb099-1e59-435c-91ed-16119c00851e'
[2019-05-29 10:36:35,701] 387 DEBUG MainThread ssh.expect :: Output:
+-------------------------------------+-----------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | controller-0 |

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Ghada Khalil (gkhalil)
summary: - live_migrate failed by VM host not switched
+ VM remains on original host after live migration issued
Changed in starlingx:
status: New → Incomplete
Ghada Khalil (gkhalil)
tags: added: stx.distro.openstack
Revision history for this message
Bruce Jones (brucej) wrote :

Check to see if there is a rollback log in the Nova conductor - which would be present in a failure case if a migration could not complete.

Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → yong hu (yhu6)
Changed in starlingx:
assignee: yong hu (yhu6) → Lin Shuicheng (shuicheng)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; live migration is a key operation for starlingx

tags: added: stx.2.0
Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

From nova compute log, it seems live migration is skipped due to vm not in resize state.
Not sure it is expected behavior, or openstack issue. Need further check it.

From containers/nova-compute-controller-0-a762cb46-4bmff_openstack_nova-compute-7d2d8853a1194fa39de830c24b17f2a08d4a80ef6ebc03ec7b3288633a1d7dc2.log:
{"log":"2019-05-29 10:35:30.014 910810 WARNING nova.compute.resource_tracker [req-f79327eb-229e-4aa0-ab8c-1b600ae7b661 - - - - -] [instance: eecdb099-1e59-435c-91ed-16119c00851e] Instance not resizing, skipping migration.\n","stream":"stdout","time":"2019-05-29T10:35:30.014616414Z"}
{"log":"2019-05-29 10:35:30.014 910810 WARNING nova.compute.resource_tracker [req-f79327eb-229e-4aa0-ab8c-1b600ae7b661 - - - - -] [instance: eecdb099-1e59-435c-91ed-16119c00851e] Instance not resizing, skipping migration.\n","stream":"stdout","time":"2019-05-29T10:35:30.014775046Z"}

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced again on,
Lab: WP_1_2
Load: 20190623T233000Z

New log attached

Cindy Xie (xxie1)
Changed in starlingx:
assignee: Lin Shuicheng (shuicheng) → zhipeng liu (zhipengs)
Revision history for this message
yong hu (yhu6) wrote :

@peng, can you paste the steps how this VM was created?
@Shuicheng and Zhipeng, we might look into what kind of VMs can be live migrated and what couldn't.

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (4.3 KiB)

[2019-05-29 10:34:14,306] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume 7c394ef9-1eff-4a8c-a665-ec785c35ae53 --key-name keypair-tenant2 --flavor 2f5d584c-da23-4cb7-9d39-e74c0c956057 --nic net-id=ee12a870-e702-4ecc-89bf-a35b009f2bee --nic net-id=0ad66478-db8e-4579-ac89-d5d944f6194f tenant2-tis-centos-guest-8 --poll'
[2019-05-29 10:34:33,606] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-------------------------------------------------+
| Property | Value |
+--------------------------------------+-------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | e38zaiYB9Nvz |
| config_drive | |
| created | 2019-05-29T10:34:17Z |
| description | - |
| flavor:disk | 9 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {"hw:mem_page_size": "2048"} |
| flavor:original_name | live-mig |
| flavor:ram | 1024 |
| flavor:swap | 0 |
| flavor:vcpus | 1 |
| hostId | |
| id | eecdb099-1e59-435c-91ed-16119c00851e |
| image | Attempt to boot from volume - no image supplied |
| key_name | keypair-tenant2 |
| locked | False ...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Peng,

In ALL_NODES_20190529.144549.tar log
From log, pre_live_migration call in controller-1 failed, time out
{"log":"2019-05-29 10:36:00.485 910810 ERROR nova.compute.manager [-] [instance: eecdb099-1e59-435c-91ed-16119c00851e] Pre live migration failed at controller-1: RemoteError: Remote error: ClientException Gateway Time-out (HTTP 504)\n","stream":"stdout","time":"2019-05-29T10:36:00.487664706Z"}
{"log":"[u'Traceback (most recent call last):\\n', u' File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/server.py\", line 166.
However, in this log, we have no logs in controller-1

In ALL_NODES_20190624.130012.tar
I saw below log.
Unable to establish connection to http://placement-api.openstack.svc.cluster.local:8778
there is a placement issue fixed a later after 0624.

So you'd better use latest green daily build to reproduce it and provide new log.
If possible, please enable nova debug log!

Thanks!
Zhipeng

Revision history for this message
yong hu (yhu6) wrote :

@peng, please monitor this issue on GREEN build after 0624.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

See https://bugs.launchpad.net/starlingx/+bug/1837759 which is reporting the same live migration issue consistently.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on Zhipeng's comment in 1837759
https://bugs.launchpad.net/starlingx/+bug/1837759/comments/12
this bug doesn't seem to be a duplicate of 1837759 given 1837759 was introduced around the middle of July 2019 and this bug is reported from a load built in May 2019.

Removing the duplicate link.

Based on Zhipeng's comment above:
https://bugs.launchpad.net/starlingx/+bug/1830915/comments/9
this should be marked as Fix Released with a pointer to the placement gerrit review if possible.

Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

Agree with Ghada comment 12. I suspect this is the same issue as https://bugs.launchpad.net/starlingx/+bug/1829062 so should be fixed.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

So marking as Fix Released by the following placement commits:
https://review.opendev.org/#/c/662371/
https://review.opendev.org/#/c/662614/

Changed in starlingx:
status: Confirmed → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

The issue was not reproduced recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.