Tempest tests "test_promote_out_of_sync_share_replica" and "test_resync_share_replica" are concurrency-prone

Bug #1631314 reported by Valeriy Ponomaryov on 2016-10-07
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Manila
Medium
Douglas Viroel

Bug Description

Recently, in CI, was observed error in one of replication tests [1].

Where we do following:
1) Create share
2) Create replica
3) Reset state of replica to "out_of_sync"
4) Wait for updated "out_of_sync" status
5) Try to promote replica

And there is possible concurrency between steps 3 and 4 when we can get periodic replica update that occurred in case of observed error [1].

Proof:

Call of API comes first [2], then arrives periodic update [3] and, finally, tempest test tries to get updated status in [4] and sees that it has "in_sync" status as was [1].

[1] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/tempest_2/testr_results.html.gz

[2] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-api.txt.gz#_2016-10-07_05_55_42_938

[3] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-shr.txt.gz#_2016-10-07_05_55_43_046

[4] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-api.txt.gz#_2016-10-07_05_55_43_175

summary: - Concurrency of periodic replica update call and manual replica status
- update
+ Tempest test "test_promote_out_of_sync_share_replica" is concurrency-
+ prone
description: updated
description: updated
description: updated
tags: added: replication tempest
Changed in manila:
assignee: nobody → Goutham Pacha Ravi (gouthamr)
milestone: none → ocata-1
status: New → Confirmed
Goutham Pacha Ravi (gouthamr) wrote :

Okay, after discussing with Valeriy, the approach to get this right is to:

1) Yank out the tests that attempt to reset state in order to test a particular scenario (reset replica_state to out_of_sync and attempt promoting as user, and as admin; reset replica_state to error and attempt to promote; reset replica_state to out_of_sync and attempt re-syncing); or:

2) Allow for such tests to have more meaning by introducing an API to enable/disable periodic replica update on demand, per share.

We can discuss these solutions, but can I put up a patch to sanely turn off these tests for the dummy driver?

Aashaka Shah (aashah) wrote :

Is this bug still present? If so, I would like to contribute a patch for it. The log files in the references are no longer online. Are they still up somewhere else?

Goutham Pacha Ravi (gouthamr) wrote :

@Aashaka Shah, Yes. this bug is still present. However, the prevalence hasn't been documented of late.

The behavior here, as Valeriy noted is because there's an update thread in the share manager service that polls for the health of replicas and updates their health status. This thread is not controlled via the API.

When an administrator requests the state to be explicitly changed to 'out_of_sync' (for testing purposes) via the reset-replica-state API, the state change is directly made on the database, however, the update thread may run and change the state back to 'in_sync' (because that's what it truly is). The test currently waits for state 'out_of_sync' and times out.

We need a way for this test to pass deterministically and catch real regressions. We can go with the methods I described in my previous response, or, somehow make the state change persist these sort of asynchronous updates from the update thread.

I'd be glad to help you fix this bug, or provide any details you might need.

Changed in manila:
assignee: Goutham Pacha Ravi (gouthamr) → nobody
milestone: ocata-1 → none
importance: High → Medium
Tom Barron (tpb) on 2018-06-19
tags: added: races
Tom Barron (tpb) wrote :

We hit this one today in dummy back end gate test for https://review.openstack.org/#/c/637626/ - manila_tempest_tests.tests.api.test_replication_negative.ReplicationNegativeTest.test_promote_out_of_sync_share_replica[api_with_backend,negative -

raceback (most recent call last):
  File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/tests/api/test_replication_negative.py", line 128, in test_promote_out_of_sync_share_replica
    status_attr='replica_state')
  File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/services/share/v2/json/shares_client.py", line 1575, in wait_for_share_replica_status
    raise exceptions.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: The replica_state of Replica 1201662c-16f2-40bb-8398-564481afd20d failed to reach out_of_sync status within the required time (180s). Current replica_state: in_sync.

summary: - Tempest test "test_promote_out_of_sync_share_replica" is concurrency-
- prone
+ Tempest tests "test_promote_out_of_sync_share_replica" and
+ "test_resync_share_replica" are concurrency-prone
Jason Grosso (jgrosso) wrote :

do we have a milestone when this issue issue will be addressed?

Jason Grosso (jgrosso) on 2019-08-15
Changed in manila:
assignee: nobody → Douglas Viroel (dviroel)
Jason Grosso (jgrosso) wrote :

Hey Douglas if you get a chance can you update when you think the milestone for this maybe?

Douglas Viroel (dviroel) wrote :

Hi Jason, will depend on the solution that will be adopted here. I can updated after defining all changes that will be needed.

Goutham, I think that the best solution so far is to create a new replica-periodic-update enable/disable command on share replica API. This operation will affect the desired share only.
In the periodic replica update routine we just need to check if it is enabled before issuing the driver for updating the replica state.
Now, do you think that this state (enable/disable replica update) will need to be stored in DB? Or can be reset to default(enabled) if the share-manager restarts?

Goutham Pacha Ravi (gouthamr) wrote :

Douglas,

> Goutham, I think that the best solution so far is to create a new replica-periodic-update enable/disable command on share replica API. This operation will affect the desired share only.
> In the periodic replica update routine we just need to check if it is enabled before issuing the driver for updating the replica state.
> Now, do you think that this state (enable/disable replica update) will need to be stored in DB? Or can be reset to default(enabled) if the share-manager restarts?

Yes - the toggle-able option to enable/disable needs to be persisted in the database; otherwise the behavior may be non-deterministic: 1) Service HA might interfere with the updates - one share manager might continue to update the replica/s, while another would have such updates disabled 2) Replicas are tenant resources, administrators own the service plane - so it would be hard to justify a ephemeral disable operation since the ones controlling the replica might be unaware of service restarts.

This might need some restructuring of the tests, so that enable/disable actually works as intended. I wouldn't shoot to get this bug fix in Train though - we can try to get the fix up and then evaluate.

At present, I am okay with disabling the tests while we ship the train release.

Vida Haririan (vhariria) wrote :

Feedback from Description:Upstream Manila IRC - Meeting started Thu Feb 6 15:01:04 2020 UTC

This case can occur for a minute duration if users query the manila API while database transitions are happening - the bug proposes two ways to mitigate this, we need to find an owner to investigate dviroel: any status on the first one?
gouthamr not yet, the proposed solution will take extra effort

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers