OpenStack Shared File Systems Service (Manila)

Tempest tests "test_promote_out_of_sync_share_replica" and "test_resync_share_replica" are concurrency-prone

Bug #1631314 reported by Valeriy Ponomaryov on 2016-10-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Shared File Systems Service (Manila)	Fix Released	Medium	Fábio Oliveira	OpenStack Shared File Systems Service (Manila) yoga-1

Bug Description

Recently, in CI, was observed error in one of replication tests [1].

Where we do following:
1) Create share
2) Create replica
3) Reset state of replica to "out_of_sync"
4) Wait for updated "out_of_sync" status
5) Try to promote replica

And there is possible concurrency between steps 3 and 4 when we can get periodic replica update that occurred in case of observed error [1].

Proof:

Call of API comes first [2], then arrives periodic update [3] and, finally, tempest test tries to get updated status in [4] and sees that it has "in_sync" status as was [1].

[1] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/tempest_2/testr_results.html.gz

[2] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-api.txt.gz#_2016-10-07_05_55_42_938

[3] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-shr.txt.gz#_2016-10-07_05_55_43_046

[4] http://logs.openstack.org/19/383119/2/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/9386587/logs/screen-m-api.txt.gz#_2016-10-07_05_55_43_175

See original description

Tags:

Valeriy Ponomaryov (vponomaryov) on 2016-10-07

summary:	- Concurrency of periodic replica update call and manual replica status - update + Tempest test "test_promote_out_of_sync_share_replica" is concurrency- + prone
description:	updated
description:	updated
description:	updated
tags:	added: replication tempest

Revision history for this message

Valeriy Ponomaryov (vponomaryov) wrote on 2016-10-11: Re: Tempest test "test_promote_out_of_sync_share_replica" is concurrency-prone

One more occurrence:

http://logs.openstack.org/06/384606/1/check/gate-manila-tempest-minimal-dsvm-dummy-ubuntu-xenial-nv/1d08dc3/logs/tempest_2/testr_results.html.gz

Changed in manila:
importance:	Undecided → High

Goutham Pacha Ravi (gouthamr) on 2016-10-11

Changed in manila:
assignee:	nobody → Goutham Pacha Ravi (gouthamr)
milestone:	none → ocata-1
status:	New → Confirmed

Revision history for this message

Goutham Pacha Ravi (gouthamr) wrote on 2016-10-12:

Okay, after discussing with Valeriy, the approach to get this right is to:

1) Yank out the tests that attempt to reset state in order to test a particular scenario (reset replica_state to out_of_sync and attempt promoting as user, and as admin; reset replica_state to error and attempt to promote; reset replica_state to out_of_sync and attempt re-syncing); or:

2) Allow for such tests to have more meaning by introducing an API to enable/disable periodic replica update on demand, per share.

We can discuss these solutions, but can I put up a patch to sanely turn off these tests for the dummy driver?

Revision history for this message

Aashaka Shah (aashah) wrote on 2018-01-30:

Is this bug still present? If so, I would like to contribute a patch for it. The log files in the references are no longer online. Are they still up somewhere else?

Revision history for this message

Goutham Pacha Ravi (gouthamr) wrote on 2018-03-16:

@Aashaka Shah, Yes. this bug is still present. However, the prevalence hasn't been documented of late.

The behavior here, as Valeriy noted is because there's an update thread in the share manager service that polls for the health of replicas and updates their health status. This thread is not controlled via the API.

When an administrator requests the state to be explicitly changed to 'out_of_sync' (for testing purposes) via the reset-replica-state API, the state change is directly made on the database, however, the update thread may run and change the state back to 'in_sync' (because that's what it truly is). The test currently waits for state 'out_of_sync' and times out.

We need a way for this test to pass deterministically and catch real regressions. We can go with the methods I described in my previous response, or, somehow make the state change persist these sort of asynchronous updates from the update thread.

I'd be glad to help you fix this bug, or provide any details you might need.

Changed in manila:
assignee:	Goutham Pacha Ravi (gouthamr) → nobody
milestone:	ocata-1 → none
importance:	High → Medium

Tom Barron (tpb) on 2018-06-19

tags:

added: races

Revision history for this message

Tom Barron (tpb) wrote on 2019-02-19:

We hit this one today in dummy back end gate test for https://review.openstack.org/#/c/637626/ - manila_tempest_tests.tests.api.test_replication_negative.ReplicationNegativeTest.test_promote_out_of_sync_share_replica[api_with_backend,negative -

raceback (most recent call last):
  File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/tests/api/test_replication_negative.py", line 128, in test_promote_out_of_sync_share_replica
    status_attr='replica_state')
  File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/services/share/v2/json/shares_client.py", line 1575, in wait_for_share_replica_status
    raise exceptions.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: The replica_state of Replica 1201662c-16f2-40bb-8398-564481afd20d failed to reach out_of_sync status within the required time (180s). Current replica_state: in_sync.

Goutham Pacha Ravi (gouthamr) on 2019-03-12

summary:

- Tempest test "test_promote_out_of_sync_share_replica" is concurrency-
- prone
+ Tempest tests "test_promote_out_of_sync_share_replica" and
+ "test_resync_share_replica" are concurrency-prone

Revision history for this message

Jason Grosso (jgrosso) wrote on 2019-07-15:

do we have a milestone when this issue issue will be addressed?

Jason Grosso (jgrosso) on 2019-08-15

Changed in manila:
assignee:	nobody → Douglas Viroel (dviroel)

Revision history for this message

Jason Grosso (jgrosso) wrote on 2019-08-15:

Hey Douglas if you get a chance can you update when you think the milestone for this maybe?

Revision history for this message

Douglas Viroel (dviroel) wrote on 2019-08-29:

Hi Jason, will depend on the solution that will be adopted here. I can updated after defining all changes that will be needed.

Goutham, I think that the best solution so far is to create a new replica-periodic-update enable/disable command on share replica API. This operation will affect the desired share only.
In the periodic replica update routine we just need to check if it is enabled before issuing the driver for updating the replica state.
Now, do you think that this state (enable/disable replica update) will need to be stored in DB? Or can be reset to default(enabled) if the share-manager restarts?

Revision history for this message

Goutham Pacha Ravi (gouthamr) wrote on 2019-09-13:

Douglas,

> Goutham, I think that the best solution so far is to create a new replica-periodic-update enable/disable command on share replica API. This operation will affect the desired share only.
> In the periodic replica update routine we just need to check if it is enabled before issuing the driver for updating the replica state.
> Now, do you think that this state (enable/disable replica update) will need to be stored in DB? Or can be reset to default(enabled) if the share-manager restarts?

Yes - the toggle-able option to enable/disable needs to be persisted in the database; otherwise the behavior may be non-deterministic: 1) Service HA might interfere with the updates - one share manager might continue to update the replica/s, while another would have such updates disabled 2) Replicas are tenant resources, administrators own the service plane - so it would be hard to justify a ephemeral disable operation since the ones controlling the replica might be unaware of service restarts.

This might need some restructuring of the tests, so that enable/disable actually works as intended. I wouldn't shoot to get this bug fix in Train though - we can try to get the fix up and then evaluate.

At present, I am okay with disabling the tests while we ship the train release.

Revision history for this message

Vida Haririan (vhariria) wrote on 2020-02-06:

#10

Feedback from Description:Upstream Manila IRC - Meeting started Thu Feb 6 15:01:04 2020 UTC

This case can occur for a minute duration if users query the manila API while database transitions are happening - the bug proposes two ways to mitigate this, we need to find an owner to investigate dviroel: any status on the first one?
gouthamr not yet, the proposed solution will take extra effort

Goutham Pacha Ravi (gouthamr) on 2021-10-14

Changed in manila:
assignee:	Douglas Viroel (dviroel) → nobody

Fábio Oliveira (fabiooliveira1) on 2021-10-14

Changed in manila:
assignee:	nobody → Fábio Oliveira (fabiooliveira1)

Vida Haririan (vhariria) on 2021-10-14

Changed in manila:
milestone:	none → yoga-1

Revision history for this message

Vida Haririan (vhariria) wrote on 2021-10-14:

#11

Discussed this bug in detail, see comments here: https://meetings.opendev.org/meetings/manila/2021/manila.2021-10-14-15.01.log.html

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-15: Fix proposed to manila-tempest-plugin (master)

#12

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/manila-tempest-plugin/+/814202

Changed in manila:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-23: Fix merged to manila-tempest-plugin (master)

#13

Reviewed: https://review.opendev.org/c/openstack/manila-tempest-plugin/+/814202
Committed: https://opendev.org/openstack/manila-tempest-plugin/commit/8866127d8b9bc7c2811c9025be8acf2228adb0f8
Submitter: "Zuul (22348)"
Branch: master

commit 8866127d8b9bc7c2811c9025be8acf2228adb0f8
Author: Fabio Oliveira <email address hidden>
Date: Fri Oct 15 10:53:51 2021 -0300

Add unstable tag on some replication tests

The tempest tests `test_promote_out_of_sync_share_replica` and
`test_resync_share_replica` are concurrency-prone.

Theses tests are trying to determine something that shouldn't really
be a problem and may require to over-engineer the APIs to solve it.

Closes-Bug: #1631314
Change-Id: I4519147f2c1e7f4ea060d83b1a34d0a85dca9687

Changed in manila:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-18: Fix included in openstack/manila-tempest-plugin 1.7.0

#14

This issue was fixed in the openstack/manila-tempest-plugin 1.7.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.