OpenStack Compute (nova)

test_parallel_evacuate_with_server_group intermittently fails

Bug #1763181 reported by Matt Riedemann on 2018-04-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Triaged	Medium	Unassigned

Bug Description

http://logs.openstack.org/54/560454/1/gate/nova-tox-functional-py35/40cf8c5/job-output.txt.gz#_2018-04-11_18_49_13_614017

2018-04-11 18:49:13.618886 | ubuntu-xenial | Captured traceback:
2018-04-11 18:49:13.618960 | ubuntu-xenial | ~~~~~~~~~~~~~~~~~~~
2018-04-11 18:49:13.619063 | ubuntu-xenial | b'Traceback (most recent call last):'
2018-04-11 18:49:13.619319 | ubuntu-xenial | b' File "/home/zuul/src/git.openstack.org/openstack/nova/nova/tests/functional/regressions/test_bug_1735407.py", line 158, in test_parallel_evacuate_with_server_group'
2018-04-11 18:49:13.619446 | ubuntu-xenial | b" server2['OS-EXT-SRV-ATTR:host'])"
2018-04-11 18:49:13.619949 | ubuntu-xenial | b' File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional-py35/lib/python3.5/site-packages/unittest2/case.py", line 845, in assertNotEqual'
2018-04-11 18:49:13.620066 | ubuntu-xenial | b' raise self.failureException(msg)'
2018-04-11 18:49:13.620172 | ubuntu-xenial | b"AssertionError: 'host3' == 'host3'"
2018-04-11 18:49:13.620237 | ubuntu-xenial | b''

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AssertionError%3A%20'host3'%20%3D%3D%20'host3'%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

10 hits in 7 days, check and gate, all failures.

Tags:

Matt Riedemann (mriedem) on 2018-04-12

tags:

added: gate-failure

Revision history for this message

Chris Friesen (cbf123) wrote on 2018-04-12:

My suspicion is that when we call _validate_instance_group_policy() which calls group.get_hosts(), that we will only return the original hosts where they're evacuating *from*, not the host they're both evacuating *to*, because we don't set instance.host to the new host until the evacuation is basically done.

There's a second thing to worry about when we fix this, which is that we don't want both instances to detect a collision and fail...ideally we'd like one of them to succeed. The way we've done this internally is for each instance to compare itself against other instances in the same group that either 1) have id numbers smaller than the current instance so they were created earlier, or 2) have id numbers larger than the current instance but that have a vm_state that isn't BUILDING and a task_state that is None so that they are "established" on this host and won't be moving.

Revision history for this message

Chris Friesen (cbf123) wrote on 2018-04-12:

I think for the evacuate case (as well as migration/resize and live-migration) we'd want to do something like:

compare_instances = [i for i in instance_group.members if (i.id < instance_id or (i.id > instance_id and (instance has passed late-state group policy validation)))]

hostlist = list(set([i.host for i in compare_instances if i.host]))

migration_hostlist = (set of source and dest hosts for all instances in compare_instances that are currently migrating/live-migrating/evacuating/resizing)

Then combine the two lists of hosts and use that for checking against in the late-stage validation.

Revision history for this message

Matthew Booth (mbooth-9) wrote on 2018-09-26:

There's definitely an obvious race here which I think Chris is describing in comment 1? I'm going to describe it again in my own words, just to be sure:

In _validate_instance_group_policy, whilst holding a lock on the name of the group we check the members of the group against InstanceList.get_uuids_by_host()[1]. However, at the time of executing this method, instance.host still refers to the source, so the current instance will not be returned in that list. instance.host is updated much later outside the critical section. The sequence of the race is therefore:

Instance A Instance B

instance.host = source instance.host = source
_validate_instance_group_policy() _validate_instance_group_policy()
SUCCESS -> Neither A nor B on this host SUCCESS -> Neither A nor B on this host
... rebuild ... ... rebuild ...
instance.host = dest instance.host = dest

Note that to use a lock here we would need to update instance.host *in the same critical section as the test*. There may be other ways to achieve this, but this is what needs to happen in order to use a regular lock.

[1] Note that this critical section doesn't modify any data and this lock isn't used anywhere else which does modify data, so this lock serves no purpose. If we don't modify the code to make use of the critical section we should remove this lock: it achieves nothing and worse it seems to have confused folks.

Chris, I read your instance id comparison thing a few times but I couldn't parse it. I don't understand how this would work. Is there some StarlingX code to look at?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-26: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/605436

Changed in nova:
assignee:	nobody → Matthew Booth (mbooth-9)
status:	Confirmed → In Progress

Revision history for this message

Chris Friesen (cbf123) wrote on 2018-09-26:

The rationale for the check against "older" instances is to try to avoid the case where two anti-affinity instances land on the same node, they *both* detect problems, and they *both* reschedule, potentially landing on the same node again. Ideally we want only one of them to reschedule and the other to continue.

The starlingx server group validation stuff is here:
https://github.com/starlingx-staging/stx-nova/blob/master/nova/compute/manager.py#L1408-L1436

The check against "older" instances is here:
https://github.com/starlingx-staging/stx-nova/blob/master/nova/objects/instance_group.py#L550-L572

Note that the check is called get_older_member_hosts() because it used to just check for instances that were created before the instance in question, but now it also looks at "newer" instances which are "further ahead" in being created. (To account for out-of-order processing.)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/607620

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-04: Related fix merged to nova (master)

Reviewed: https://review.openstack.org/607620
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=14a3b1318cf84983e1c029ffa1d3a7b2d0223779
Submitter: Zuul
Branch: master

commit 14a3b1318cf84983e1c029ffa1d3a7b2d0223779
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 3 11:42:15 2018 -0400

Skip test_parallel_evacuate_with_server_group until fixed

This test has a high failure rate so let's skip it until
the bug is fixed:

http://status.openstack.org/elastic-recheck/#1763181

Change-Id: Idcd2e7118c8d985d6d2cc4b12c08388406229bbf
Related-Bug: #1763181

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-06: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/608417

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-04: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/649963

Changed in nova:
assignee:	Matthew Booth (mbooth-9) → Boxiang Zhu (bxzhu-5355)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-03: Related fix proposed to nova (stable/rocky)

#10

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/686402

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-23: Related fix merged to nova (stable/rocky)

#11

Reviewed: https://review.opendev.org/686402
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c072993e0953e9bd45ce4265414b8cbefe1ca7a1
Submitter: Zuul
Branch: stable/rocky

commit c072993e0953e9bd45ce4265414b8cbefe1ca7a1
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 3 11:42:15 2018 -0400

Skip test_parallel_evacuate_with_server_group until fixed

This test has a high failure rate so let's skip it until
the bug is fixed:

http://status.openstack.org/elastic-recheck/#1763181

    Change-Id: Idcd2e7118c8d985d6d2cc4b12c08388406229bbf
    Related-Bug: #1763181
    (cherry picked from commit 14a3b1318cf84983e1c029ffa1d3a7b2d0223779)

tags:

added: in-stable-rocky

Lee Yarwood (lyarwood) on 2021-06-10

Changed in nova:
status:	In Progress → Triaged
assignee:	Boxiang Zhu (bxzhu-5355) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-11: Change abandoned on nova (master)

#12

Change abandoned by "Boxiang Zhu <zhu.boxiang@99cloud.net>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/649963

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1768473

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.