live_migrate failed on anti_affinity server_group set VM

Bug #1824167 reported by Peng Peng
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned
StarlingX
Confirmed
High
yong hu

Bug Description

Brief Description
-----------------
Create a server group with policy set to anti_affinity; Boot 2 vm(s) and make sure VMs in this server group; successfully live_migrate VM_1, but failed to live_migrate VM_2

Severity
--------
Major

Steps to Reproduce
------------------
as description
....
TC-name: testcases/functional/nova/test_server_group.py::test_server_group_boot_vms[anti_affinity-2]

Expected Behavior
------------------
VM_2 live_migrate to different host

Actual Behavior
----------------
VM_2 stays at same host

Reproducibility
---------------
Reproducible
100%

System Configuration
--------------------
Multi-node system (3 compute nodes)

Lab-name: WCP_99-103

Branch/Pull Time/Commit
-----------------------
stx master as of 20190410T013000Z

Last Pass
---------
20190408T233001Z

Timestamp/Logs
--------------
[2019-04-10 11:41:13,777] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show dd2b7c56-0f3c-4a65-b8b1-31f36f63fb2f'
[2019-04-10 11:41:15,229] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-1 |

[2019-04-10 11:41:15,333] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration dd2b7c56-0f3c-4a65-b8b1-31f36f63fb2f'

[2019-04-10 11:42:47,654] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show dd2b7c56-0f3c-4a65-b8b1-31f36f63fb2f'
[2019-04-10 11:42:48,964] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-1 |

Test Activity
-------------
Regression Testing

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; live migration failure - 100% reproducible as per the reporter.

Note that the last pass would have been with the nova master docker image. This execution would have likely used the nova stein docker image. Maybe there's a clue there.

tags: added: stx.2.0 stx.distro.openstack
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Bruce Jones (brucej)
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Bruce Jones (brucej) wrote :

Expecting this to be resolved by the backport of Artom's changes from Train.

Changed in starlingx:
assignee: Bruce Jones (brucej) → Gerry Kopec (gerry-kopec)
Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

I think this is separate from Artom's changes. We need to retest/investigate this on a more recent build.

Revision history for this message
Frank Miller (sensfan22) wrote :

Bruce can you try and find another person to prime this LP as Gerry is tied up with finishing his SB and other stx2.0 gating LPs

Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → Bruce Jones (brucej)
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → yong hu (yhu6)
Revision history for this message
yong hu (yhu6) wrote :

@ppeng, please provide a bit more info:
1. what was your deployment like? for example, 2+2+2 (computer nodes)??
2. server group info in details, output from cmd: openstack server group list and openstack server group show <server_group_uuid>
3. before live-migration, where was VM1 and VM2 running respectively?
4. after live-migration, when saying "successfully live_migrate VM_1, but failed to live_migrate VM_2", where was VM_1 located? was it migrated to a different host?

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

System is 2+3. Please check attached Test execution log for all information you want.

Revision history for this message
yong hu (yhu6) wrote :

I deployed a 2+3 StarlingX and created 2 VMs by following the operations captured from "test_execution_log.txt", and I got some findings:

1. at the beginning, 2 VMs were scheduled to different compute nodes among 3 (all were not occupied in the 1st place), with server_group having anti-affinity as the policy:
controller-0:~$ openstack server show 60f5707f-2047-4d31-b852-b687de4b80f9 | grep host
| OS-EXT-SRV-ATTR:host | compute-0 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-0 |
| hostId | 2ef50640a2d5bde91dc5f9c480fec273f05d35cd7d4b617753833dfc |
controller-0:~$ openstack server show 10b25605-93d0-4a5b-bd34-2a6c32585d22 | grep host
| OS-EXT-SRV-ATTR:host | compute-1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1 |

2. with live-migration, there were 2 different results:
2.1). when doing live-migration on vm#1 60f5707f-2047-4d31-b852-b687de4b80f9, waiting for >1 minute, and then live-migration on vm#2 10b25605-93d0-4a5b-bd34-2a6c32585d22, they were scheduled to 2 different compute nodes among these 3.

2.2). when doing live-migrations on these 2 VMs without enough interval, they both were very likely scheduled to the 3rd compute node in which there was no VMs previously.

So the conclusion would be "anti-affinity" policy in server group did work, but "nova-scheduler" did spend time to refresh the status of "nova-compute" nodes. If the next scheduling (triggered by either live-migration or new vm creation) takes places too soon, like before the status refreshing is done, the scheduler just has to pick up one of other nodes (other than the current node), so it is possible to have 2 VMs on the same compute node (the 3rd node).

Moreover, if you have more than 3 compute nodes, for the case of 2 VMs live-migration, the chance of "collision" could be smaller, because at any time, there are at lease 2 zero-loaded nodes.

In short: "anti-affinity" functionality is right, but "nova-scheduler" performance in terms of status update is somehow poor.

yong hu (yhu6)
Changed in starlingx:
status: Incomplete → Confirmed
Peng Peng (ppeng)
tags: removed: stx.retestneeded
Yang Liu (yliu12)
tags: added: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.