Tempest tests "test_promote_out_of_sync_share_replica" and "test_resync_share_replica" are concurrency-prone

Bug #1631314 reported by Valeriy Ponomaryov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Fix Released
Medium
Fábio Oliveira

Bug Description

Recently, in CI, was observed error in one of replication tests [1].

Where we do following:
1) Create share
2) Create replica
3) Reset state of replica to "out_of_sync"
4) Wait for updated "out_of_sync" status
5) Try to promote replica

And there is possible concurrency between steps 3 and 4 when we can get periodic replica update that occurred in case of observed error [1].

Proof:

Call of API comes first [2], then arrives periodic update [3] and, finally, tempest test tries to get updated status in [4] and sees that it has "in_sync" status as was [1].

[1] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/tempest_2/testr_results.html.gz

[2] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-api.txt.gz#_2016-10-07_05_55_42_938

[3] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-shr.txt.gz#_2016-10-07_05_55_43_046

[4] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-api.txt.gz#_2016-10-07_05_55_43_175

summary: - Concurrency of periodic replica update call and manual replica status
- update
+ Tempest test "test_promote_out_of_sync_share_replica" is concurrency-
+ prone
description: updated
description: updated
description: updated
tags: added: replication tempest
Revision history for this message
Valeriy Ponomaryov (vponomaryov) wrote : Re: Tempest test "test_promote_out_of_sync_share_replica" is concurrency-prone
Changed in manila:
importance: Undecided → High
Changed in manila:
assignee: nobody → Goutham Pacha Ravi (gouthamr)
milestone: none → ocata-1
status: New → Confirmed
Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

Okay, after discussing with Valeriy, the approach to get this right is to:

1) Yank out the tests that attempt to reset state in order to test a particular scenario (reset replica_state to out_of_sync and attempt promoting as user, and as admin; reset replica_state to error and attempt to promote; reset replica_state to out_of_sync and attempt re-syncing); or:

2) Allow for such tests to have more meaning by introducing an API to enable/disable periodic replica update on demand, per share.

We can discuss these solutions, but can I put up a patch to sanely turn off these tests for the dummy driver?

Revision history for this message
Aashaka Shah (aashah) wrote :

Is this bug still present? If so, I would like to contribute a patch for it. The log files in the references are no longer online. Are they still up somewhere else?

Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

@Aashaka Shah, Yes. this bug is still present. However, the prevalence hasn't been documented of late.

The behavior here, as Valeriy noted is because there's an update thread in the share manager service that polls for the health of replicas and updates their health status. This thread is not controlled via the API.

When an administrator requests the state to be explicitly changed to 'out_of_sync' (for testing purposes) via the reset-replica-state API, the state change is directly made on the database, however, the update thread may run and change the state back to 'in_sync' (because that's what it truly is). The test currently waits for state 'out_of_sync' and times out.

We need a way for this test to pass deterministically and catch real regressions. We can go with the methods I described in my previous response, or, somehow make the state change persist these sort of asynchronous updates from the update thread.

I'd be glad to help you fix this bug, or provide any details you might need.

Changed in manila:
assignee: Goutham Pacha Ravi (gouthamr) → nobody
milestone: ocata-1 → none
importance: High → Medium
Tom Barron (tpb)
tags: added: races
Revision history for this message
Tom Barron (tpb) wrote :

We hit this one today in dummy back end gate test for https://review.openstack.org/#/c/637626/ - manila_tempest_tests.tests.api.test_replication_negative.ReplicationNegativeTest.test_promote_out_of_sync_share_replica[api_with_backend,negative -

raceback (most recent call last):
  File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/tests/api/test_replication_negative.py", line 128, in test_promote_out_of_sync_share_replica
    status_attr='replica_state')
  File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/services/share/v2/json/shares_client.py", line 1575, in wait_for_share_replica_status
    raise exceptions.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: The replica_state of Replica 1201662c-16f2-40bb-8398-564481afd20d failed to reach out_of_sync status within the required time (180s). Current replica_state: in_sync.

summary: - Tempest test "test_promote_out_of_sync_share_replica" is concurrency-
- prone
+ Tempest tests "test_promote_out_of_sync_share_replica" and
+ "test_resync_share_replica" are concurrency-prone
Revision history for this message
Jason Grosso (jgrosso) wrote :

do we have a milestone when this issue issue will be addressed?

Jason Grosso (jgrosso)
Changed in manila:
assignee: nobody → Douglas Viroel (dviroel)
Revision history for this message
Jason Grosso (jgrosso) wrote :

Hey Douglas if you get a chance can you update when you think the milestone for this maybe?

Revision history for this message
Douglas Viroel (dviroel) wrote :

Hi Jason, will depend on the solution that will be adopted here. I can updated after defining all changes that will be needed.

Goutham, I think that the best solution so far is to create a new replica-periodic-update enable/disable command on share replica API. This operation will affect the desired share only.
In the periodic replica update routine we just need to check if it is enabled before issuing the driver for updating the replica state.
Now, do you think that this state (enable/disable replica update) will need to be stored in DB? Or can be reset to default(enabled) if the share-manager restarts?

Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

Douglas,

> Goutham, I think that the best solution so far is to create a new replica-periodic-update enable/disable command on share replica API. This operation will affect the desired share only.
> In the periodic replica update routine we just need to check if it is enabled before issuing the driver for updating the replica state.
> Now, do you think that this state (enable/disable replica update) will need to be stored in DB? Or can be reset to default(enabled) if the share-manager restarts?

Yes - the toggle-able option to enable/disable needs to be persisted in the database; otherwise the behavior may be non-deterministic: 1) Service HA might interfere with the updates - one share manager might continue to update the replica/s, while another would have such updates disabled 2) Replicas are tenant resources, administrators own the service plane - so it would be hard to justify a ephemeral disable operation since the ones controlling the replica might be unaware of service restarts.

This might need some restructuring of the tests, so that enable/disable actually works as intended. I wouldn't shoot to get this bug fix in Train though - we can try to get the fix up and then evaluate.

At present, I am okay with disabling the tests while we ship the train release.

Revision history for this message
Vida Haririan (vhariria) wrote :

Feedback from Description:Upstream Manila IRC - Meeting started Thu Feb 6 15:01:04 2020 UTC

This case can occur for a minute duration if users query the manila API while database transitions are happening - the bug proposes two ways to mitigate this, we need to find an owner to investigate dviroel: any status on the first one?
gouthamr not yet, the proposed solution will take extra effort

Changed in manila:
assignee: Douglas Viroel (dviroel) → nobody
Changed in manila:
assignee: nobody → Fábio Oliveira (fabiooliveira1)
Vida Haririan (vhariria)
Changed in manila:
milestone: none → yoga-1
Revision history for this message
Vida Haririan (vhariria) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to manila-tempest-plugin (master)
Changed in manila:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to manila-tempest-plugin (master)

Reviewed: https://review.opendev.org/c/openstack/manila-tempest-plugin/+/814202
Committed: https://opendev.org/openstack/manila-tempest-plugin/commit/8866127d8b9bc7c2811c9025be8acf2228adb0f8
Submitter: "Zuul (22348)"
Branch: master

commit 8866127d8b9bc7c2811c9025be8acf2228adb0f8
Author: Fabio Oliveira <email address hidden>
Date: Fri Oct 15 10:53:51 2021 -0300

    Add unstable tag on some replication tests

    The tempest tests `test_promote_out_of_sync_share_replica` and
    `test_resync_share_replica` are concurrency-prone.

    Theses tests are trying to determine something that shouldn't really
    be a problem and may require to over-engineer the APIs to solve it.

    Closes-Bug: #1631314
    Change-Id: I4519147f2c1e7f4ea060d83b1a34d0a85dca9687

Changed in manila:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/manila-tempest-plugin 1.7.0

This issue was fixed in the openstack/manila-tempest-plugin 1.7.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.