test_parallel_evacuate_with_server_group intermittently fails

Bug #1763181 reported by Matt Riedemann
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned

Bug Description

http://logs.openstack.org/54/560454/1/gate/nova-tox-functional-py35/40cf8c5/job-output.txt.gz#_2018-04-11_18_49_13_614017

2018-04-11 18:49:13.618886 | ubuntu-xenial | Captured traceback:
2018-04-11 18:49:13.618960 | ubuntu-xenial | ~~~~~~~~~~~~~~~~~~~
2018-04-11 18:49:13.619063 | ubuntu-xenial | b'Traceback (most recent call last):'
2018-04-11 18:49:13.619319 | ubuntu-xenial | b' File "/home/zuul/src/git.openstack.org/openstack/nova/nova/tests/functional/regressions/test_bug_1735407.py", line 158, in test_parallel_evacuate_with_server_group'
2018-04-11 18:49:13.619446 | ubuntu-xenial | b" server2['OS-EXT-SRV-ATTR:host'])"
2018-04-11 18:49:13.619949 | ubuntu-xenial | b' File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional-py35/lib/python3.5/site-packages/unittest2/case.py", line 845, in assertNotEqual'
2018-04-11 18:49:13.620066 | ubuntu-xenial | b' raise self.failureException(msg)'
2018-04-11 18:49:13.620172 | ubuntu-xenial | b"AssertionError: 'host3' == 'host3'"
2018-04-11 18:49:13.620237 | ubuntu-xenial | b''

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AssertionError%3A%20'host3'%20%3D%3D%20'host3'%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

10 hits in 7 days, check and gate, all failures.

Matt Riedemann (mriedem)
tags: added: gate-failure
Revision history for this message
Chris Friesen (cbf123) wrote :

My suspicion is that when we call _validate_instance_group_policy() which calls group.get_hosts(), that we will only return the original hosts where they're evacuating *from*, not the host they're both evacuating *to*, because we don't set instance.host to the new host until the evacuation is basically done.

There's a second thing to worry about when we fix this, which is that we don't want both instances to detect a collision and fail...ideally we'd like one of them to succeed. The way we've done this internally is for each instance to compare itself against other instances in the same group that either 1) have id numbers smaller than the current instance so they were created earlier, or 2) have id numbers larger than the current instance but that have a vm_state that isn't BUILDING and a task_state that is None so that they are "established" on this host and won't be moving.

Revision history for this message
Chris Friesen (cbf123) wrote :

I think for the evacuate case (as well as migration/resize and live-migration) we'd want to do something like:

compare_instances = [i for i in instance_group.members if (i.id < instance_id or (i.id > instance_id and (instance has passed late-state group policy validation)))]

hostlist = list(set([i.host for i in compare_instances if i.host]))

migration_hostlist = (set of source and dest hosts for all instances in compare_instances that are currently migrating/live-migrating/evacuating/resizing)

Then combine the two lists of hosts and use that for checking against in the late-stage validation.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

There's definitely an obvious race here which I think Chris is describing in comment 1? I'm going to describe it again in my own words, just to be sure:

In _validate_instance_group_policy, whilst holding a lock on the name of the group we check the members of the group against InstanceList.get_uuids_by_host()[1]. However, at the time of executing this method, instance.host still refers to the source, so the current instance will not be returned in that list. instance.host is updated much later outside the critical section. The sequence of the race is therefore:

Instance A Instance B

instance.host = source instance.host = source
_validate_instance_group_policy() _validate_instance_group_policy()
  SUCCESS -> Neither A nor B on this host SUCCESS -> Neither A nor B on this host
... rebuild ... ... rebuild ...
instance.host = dest instance.host = dest

Note that to use a lock here we would need to update instance.host *in the same critical section as the test*. There may be other ways to achieve this, but this is what needs to happen in order to use a regular lock.

[1] Note that this critical section doesn't modify any data and this lock isn't used anywhere else which does modify data, so this lock serves no purpose. If we don't modify the code to make use of the critical section we should remove this lock: it achieves nothing and worse it seems to have confused folks.

Chris, I read your instance id comparison thing a few times but I couldn't parse it. I don't understand how this would work. Is there some StarlingX code to look at?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/605436

Changed in nova:
assignee: nobody → Matthew Booth (mbooth-9)
status: Confirmed → In Progress
Revision history for this message
Chris Friesen (cbf123) wrote :

The rationale for the check against "older" instances is to try to avoid the case where two anti-affinity instances land on the same node, they *both* detect problems, and they *both* reschedule, potentially landing on the same node again. Ideally we want only one of them to reschedule and the other to continue.

The starlingx server group validation stuff is here:
https://github.com/starlingx-staging/stx-nova/blob/master/nova/compute/manager.py#L1408-L1436

The check against "older" instances is here:
https://github.com/starlingx-staging/stx-nova/blob/master/nova/objects/instance_group.py#L550-L572

Note that the check is called get_older_member_hosts() because it used to just check for instances that were created before the instance in question, but now it also looks at "newer" instances which are "further ahead" in being created. (To account for out-of-order processing.)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/607620

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/607620
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=14a3b1318cf84983e1c029ffa1d3a7b2d0223779
Submitter: Zuul
Branch: master

commit 14a3b1318cf84983e1c029ffa1d3a7b2d0223779
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 3 11:42:15 2018 -0400

    Skip test_parallel_evacuate_with_server_group until fixed

    This test has a high failure rate so let's skip it until
    the bug is fixed:

    http://status.openstack.org/elastic-recheck/#1763181

    Change-Id: Idcd2e7118c8d985d6d2cc4b12c08388406229bbf
    Related-Bug: #1763181

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/608417

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/649963

Changed in nova:
assignee: Matthew Booth (mbooth-9) → Boxiang Zhu (bxzhu-5355)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/686402

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/686402
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c072993e0953e9bd45ce4265414b8cbefe1ca7a1
Submitter: Zuul
Branch: stable/rocky

commit c072993e0953e9bd45ce4265414b8cbefe1ca7a1
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 3 11:42:15 2018 -0400

    Skip test_parallel_evacuate_with_server_group until fixed

    This test has a high failure rate so let's skip it until
    the bug is fixed:

    http://status.openstack.org/elastic-recheck/#1763181

    Change-Id: Idcd2e7118c8d985d6d2cc4b12c08388406229bbf
    Related-Bug: #1763181
    (cherry picked from commit 14a3b1318cf84983e1c029ffa1d3a7b2d0223779)

tags: added: in-stable-rocky
Lee Yarwood (lyarwood)
Changed in nova:
status: In Progress → Triaged
assignee: Boxiang Zhu (bxzhu-5355) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Boxiang Zhu <zhu.boxiang@99cloud.net>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/649963

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers