error in the way we poll for instance states in test (and maybe in regular operation) lead to false positive timeouts

Bug #1651091 reported by Amrith Kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
In Progress
Undecided
Unassigned

Bug Description

Consider the errors reported in

http://logs.openstack.org/99/412299/1/check/gate-trove-scenario-dsvm-pxc-single-ubuntu-xenial/6851539/console.html#_2016-12-19_07_40_12_094217

The gist of it is that two instances are being polled for state transition to BUILD and then ACTIVE.

Waiting for states (['BUILD', 'ACTIVE']) for instances: [u'8beac1c9-a6b6-4d7b-9d55-b26f6ca6c1dc', u'ff4a317d-59e7-4d6f-946a-65b17c526282']

What ends up happening is this:

2016-12-19 07:40:12.165091 | 2016-12-19 07:40:12.164 | [REPORT] Waiting for states (['BUILD', 'ACTIVE']) for instances: [u'8beac1c9-a6b6-4d7b-9d55-b26f6ca6c1dc', u'ff4a317d-59e7-4d6f-946a-65b17c526282']
2016-12-19 07:40:12.167477 | 2016-12-19 07:40:12.167 | [REPORT] Waiting for states (['BUILD', 'ACTIVE']) for instance: 8beac1c9-a6b6-4d7b-9d55-b26f6ca6c1dc
2016-12-19 07:40:12.185347 | 2016-12-19 07:40:12.184 | [REPORT] Polling instance '8beac1c9-a6b6-4d7b-9d55-b26f6ca6c1dc' for state 'BUILD', was 'ACTIVE'.
2016-12-19 07:40:12.187654 | 2016-12-19 07:40:12.187 | [REPORT] Instance state was not 'BUILD', moving to the next expected state.
2016-12-19 07:40:12.209527 | 2016-12-19 07:40:12.209 | [REPORT] Polling instance '8beac1c9-a6b6-4d7b-9d55-b26f6ca6c1dc' for state 'ACTIVE', was 'ACTIVE'.
2016-12-19 07:40:12.212055 | 2016-12-19 07:40:12.211 | [REPORT] Waiting for states (['BUILD', 'ACTIVE']) for instance: ff4a317d-59e7-4d6f-946a-65b17c526282
2016-12-19 07:40:12.231413 | 2016-12-19 07:40:12.230 | [REPORT] Polling instance 'ff4a317d-59e7-4d6f-946a-65b17c526282' for state 'BUILD', was 'BUILD'.
2016-12-19 07:40:12.247751 | 2016-12-19 07:40:12.247 | [REPORT] Polling instance '8beac1c9-a6b6-4d7b-9d55-b26f6ca6c1dc' for state 'ACTIVE', was 'ACTIVE'.
2016-12-19 07:40:12.263342 | 2016-12-19 07:40:12.262 | [REPORT] Polling instance 'ff4a317d-59e7-4d6f-946a-65b17c526282' for state 'BUILD', was 'ACTIVE'.
2016-12-19 07:40:12.266257 | 2016-12-19 07:40:12.265 | [REPORT] Instance '8beac1c9-a6b6-4d7b-9d55-b26f6ca6c1dc' has gone 'ACTIVE' in 2.2s.
2016-12-19 07:40:12.284736 | 2016-12-19 07:40:12.283 | [REPORT] Polling instance 'ff4a317d-59e7-4d6f-946a-65b17c526282' for state 'BUILD', was 'ACTIVE'.
2016-12-19 07:40:12.305090 | 2016-12-19 07:40:12.304 | [REPORT] Polling instance 'ff4a317d-59e7-4d6f-946a-65b17c526282' for state 'BUILD', was 'ACTIVE'.
2016-12-19 07:40:12.324840 | 2016-12-19 07:40:12.324 | [REPORT] Polling instance 'ff4a317d-59e7-4d6f-946a-65b17c526282' for state 'BUILD', was 'ACTIVE'.
[...]
2016-12-19 07:40:14.828223 | 2016-12-19 07:40:14.827 | [REPORT] Status of instance 'ff4a317d-59e7-4d6f-946a-65b17c526282' did not change to 'BUILD' after 1201.9s.

Revision history for this message
Amrith Kumar (amrith) wrote :

The race here is as follows

   477 def _assert_instance_states(self, instance_id, expected_states,
   478 fast_fail_status=None,
   479 require_all_states=False):
   480 """Keep polling for the expected instance states until the instance
   481 acquires either the last or fast-fail state.
   482
   483 If the instance state does not match the state expected at the time of
   484 polling (and 'require_all_states' is not set) the code assumes the
   485 instance had already acquired before and moves to the next expected
   486 state.
   487 """
   488
   489 self.report.log("Waiting for states (%s) for instance: %s" %
   490 (expected_states, instance_id))
   491
   492 if fast_fail_status is None:
   493 fast_fail_status = ['ERROR', 'FAILED']
   494 found = False
   495 for status in expected_states:
   496 if require_all_states or found or self._has_status(
   497 instance_id, status, fast_fail_status=fast_fail_status):
   498 found = True
   499 start_time = timer.time()
   500 try:
   501 poll_until(lambda: self._has_status(
   502 instance_id, status,
   503 fast_fail_status=fast_fail_status),
   504 sleep_time=self.def_sleep_time,
   505 time_out=self.def_timeout)
   506 self.report.log("Instance '%s' has gone '%s' in %s." %
   507 (instance_id, status,
   508 self._time_since(start_time)))
   509 except exception.PollTimeOut:
   510 self.report.log(
   511 "Status of instance '%s' did not change to '%s' "
   512 "after %s."
   513 % (instance_id, status, self._time_since(start_time)))
   514 return False
   515 else:
   516 self.report.log(
   517 "Instance state was not '%s', moving to the next expected "
   518 "state." % status)
   519
   520 return found

The code at 496 finds the instance to have the right state and goes into the try at 501 but the instance has changed state by that point and the test at 501 now times out.

Revision history for this message
Amrith Kumar (amrith) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/412489

Changed in trove:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on trove (master)

Change abandoned by amrith (<email address hidden>) on branch: master
Review: https://review.openstack.org/412489
Reason: "abandoned for inactivity, over 3 weeks old, has a negative review comment and is not currently marked as a WIP or WF-1. please restore the change if you wish to work on it. (project:openstack/trove-specs OR project:openstack/trove OR project:openstack/trove-dashboard OR project:openstack/python-troveclient OR project:openstack/trove-integration) status:open age:3w label:Code-Review=-1 (NOT label:Workflow<0)"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by amrith (<email address hidden>) on branch: master
Review: https://review.openstack.org/412489

Revision history for this message
Amrith Kumar (amrith) wrote :
Amrith Kumar (amrith)
Changed in trove:
assignee: Amrith Kumar (amrith) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.