OpenStack Compute (nova)

live_migrate failed on anti_affinity server_group set VM

Bug #1824167 reported by Peng Peng on 2019-04-10

This bug report is a duplicate of: Bug #1821755: [SRU] live migration break the anti-affinity policy of server group simultaneously. Edit Remove

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	New	Undecided	Unassigned
	StarlingX	Confirmed	High	yong hu

Bug Description

Brief Description
-----------------
Create a server group with policy set to anti_affinity; Boot 2 vm(s) and make sure VMs in this server group; successfully live_migrate VM_1, but failed to live_migrate VM_2

Severity
--------
Major

Steps to Reproduce
------------------
as description
....
TC-name: testcases/functional/nova/test_server_group.py::test_server_group_boot_vms[anti_affinity-2]

Expected Behavior
------------------
VM_2 live_migrate to different host

Actual Behavior
----------------
VM_2 stays at same host

Reproducibility
---------------
Reproducible
100%

System Configuration
--------------------
Multi-node system (3 compute nodes)

Lab-name: WCP_99-103

Branch/Pull Time/Commit
-----------------------
stx master as of 20190410T013000Z

Last Pass
---------
20190408T233001Z

Timestamp/Logs
--------------
[2019-04-10 11:41:13,777] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show dd2b7c56-0f3c-4a65-b8b1-31f36f63fb2f'
[2019-04-10 11:41:15,229] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-1 |

[2019-04-10 11:41:15,333] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration dd2b7c56-0f3c-4a65-b8b1-31f36f63fb2f'

[2019-04-10 11:42:47,654] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show dd2b7c56-0f3c-4a65-b8b1-31f36f63fb2f'
[2019-04-10 11:42:48,964] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-1 |

Test Activity
-------------
Regression Testing

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-04-10:

ALL_NODES_20190410.142437.tar Edit (34.0 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-04-10:

Marking as release gating; live migration failure - 100% reproducible as per the reporter.

Note that the last pass would have been with the nova master docker image. This execution would have likely used the nova stein docker image. Maybe there's a clue there.

tags:	added: stx.2.0 stx.distro.openstack
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Bruce Jones (brucej)

Ghada Khalil (gkhalil) on 2019-04-10

tags:

added: stx.retestneeded

Revision history for this message

Bruce Jones (brucej) wrote on 2019-04-11:

Expecting this to be resolved by the backport of Artom's changes from Train.

Changed in starlingx:
assignee:	Bruce Jones (brucej) → Gerry Kopec (gerry-kopec)

Revision history for this message

Gerry Kopec (gerry-kopec) wrote on 2019-06-26:

I think this is separate from Artom's changes. We need to retest/investigate this on a more recent build.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-06-26:

Bruce can you try and find another person to prime this LP as Gerry is tied up with finishing his SB and other stx2.0 gating LPs

Changed in starlingx:
assignee:	Gerry Kopec (gerry-kopec) → Bruce Jones (brucej)

Bruce Jones (brucej) on 2019-06-26

Changed in starlingx:
assignee:	Bruce Jones (brucej) → yong hu (yhu6)

Revision history for this message

yong hu (yhu6) wrote on 2019-06-26:

@ppeng, please provide a bit more info:
1. what was your deployment like? for example, 2+2+2 (computer nodes)??
2. server group info in details, output from cmd: openstack server group list and openstack server group show <server_group_uuid>
3. before live-migration, where was VM1 and VM2 running respectively?
4. after live-migration, when saying "successfully live_migrate VM_1, but failed to live_migrate VM_2", where was VM_1 located? was it migrated to a different host?

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-27:

test_execution_log.txt Edit (569.7 KiB, text/plain)

System is 2+3. Please check attached Test execution log for all information you want.

Revision history for this message

yong hu (yhu6) wrote on 2019-06-27:

A successful live-migration scenario Edit (440.3 KiB, image/png)

I deployed a 2+3 StarlingX and created 2 VMs by following the operations captured from "test_execution_log.txt", and I got some findings:

2. with live-migration, there were 2 different results:
2.1). when doing live-migration on vm#1 60f5707f-2047-4d31-b852-b687de4b80f9, waiting for >1 minute, and then live-migration on vm#2 10b25605-93d0-4a5b-bd34-2a6c32585d22, they were scheduled to 2 different compute nodes among these 3.

2.2). when doing live-migrations on these 2 VMs without enough interval, they both were very likely scheduled to the 3rd compute node in which there was no VMs previously.

So the conclusion would be "anti-affinity" policy in server group did work, but "nova-scheduler" did spend time to refresh the status of "nova-compute" nodes. If the next scheduling (triggered by either live-migration or new vm creation) takes places too soon, like before the status refreshing is done, the scheduler just has to pick up one of other nodes (other than the current node), so it is possible to have 2 VMs on the same compute node (the 3rd node).

Moreover, if you have more than 3 compute nodes, for the case of 2 VMs live-migration, the chance of "collision" could be smaller, because at any time, there are at lease 2 zero-loaded nodes.

In short: "anti-affinity" functionality is right, but "nova-scheduler" performance in terms of status update is somehow poor.

I deployed a 2+3 StarlingX and created 2 VMs by following the operations captured from "test_execution_log.txt", and I got some findings:

2.2). when doing live-migrations on these 2 VMs without enough interval, they both were very likely scheduled to the 3rd compute node in which there was no VMs previously.

Moreover, if you have more than 3 compute nodes, for the case of 2 VMs live-migration, the chance of "collision" could be smaller, because at any time, there are at lease 2 zero-loaded nodes.

In short:  "anti-affinity" functionality is right, but "nova-scheduler" performance in terms of status update is somehow poor.

yong hu (yhu6) on 2019-06-28

Changed in starlingx:
status:	Incomplete → Confirmed

Peng Peng (ppeng) on 2019-06-28

tags:

removed: stx.retestneeded

Yang Liu (yliu12) on 2019-09-09

tags:

added: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1821755 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.