FAIL: test_two_regions_any_zones_three_replicas (zaza.openstack.charm_tests.swift.tests.SwiftGlobalReplicationTests)

Bug #1882247 reported by Alex Kavanagh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Swift Proxy Charm
In Progress
Undecided
Drew Freiberger

Bug Description

The zaza test fails with:

2020-06-04 22:16:24 [INFO] FAIL: test_two_regions_any_zones_three_replicas (zaza.openstack.charm_tests.swift.tests.SwiftGlobalReplicationTests)
2020-06-04 22:16:24 [INFO] Create an object with three replicas across two regions.
2020-06-04 22:16:24 [INFO] ----------------------------------------------------------------------
2020-06-04 22:16:24 [INFO] Traceback (most recent call last):
2020-06-04 22:16:24 [INFO] File "/tmp/tmp.JXiNcmQ92q/func/lib/python3.5/site-packages/zaza/openstack/charm_tests/swift/tests.py", line 231, in test_two_regions_any_zones_three_replicas
2020-06-04 22:16:24 [INFO] 3)
2020-06-04 22:16:24 [INFO] AssertionError: 2 != 3

review: https://review.opendev.org/#/c/732676/
https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-swift-proxy/732676/1/5942/index.html

This might be an unstable test as previously this has passed.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

This has passed both through OSCI and manual testing. Will continue try re-runs until it fails enough to introspect live. Agreed, seems a brittle test.

The test as written writes an object to region1 and then once that is acked, immediately grabs the object header from region2 to determine number of copies. Unanswered questions are:

1. How does write_affinity work when noting both zones in the configuration?
2. Does each region report back how many copies are only local to the region, or respond with how many copies there are across the two regions?
3. I believe there is a mechanism within swift that a third copy can be made asynchronously if there is immediate denial of service from the third-copy object server which may be the brittleness that we're experiencing. This should be investigated and handled as an allowable exception in the testing code if true.

That we set write_affinity to include both zones seems inconsistent with the advice of the Global Cluster documentation:

https://docs.openstack.org/swift/pike/overview_global_cluster.html#write-affinity

I'm wondering if using both zones in write_affinity is creating a situation in swift where writes to all 6 zones are not happening at once, and there's a race between requesting the header of the object and the async replication. There should be 3 copies of the data across the two zones immediately, but all 6 copies are not necessarily synchronous with our current config.

I'll see if I can keep re-running to get this to fail for further introspection, but this issue appears in-line with the eventual-consistency model of Swift that writes can be acked after only one write has been completed. If we want to cover the assurance of the third async copy better in testing, we could check both region's swift-storage servers for oldest replication completion time (swift-recon -r from each region's proxy), and when that time is less than the time since object write, then we can test for number of replicas.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Even more interestingly, I'm finding when looking at my two regions in my surviving models that've not been torn down yet, I'm finding that there are only two copies of the objects, one in each datacenter, even though it passed the count of 3. I'm wondering if the use of object_head is only querying the container info about the object written and not actually probing the object server for assurance that there is access to the object on that number of servers.

I'm also seeing pending locks on containers that have had no writes on these environments for over 12 hours which concerns me that these tests are not completely ensuring what was assumed to be tested.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

The two copies was due to lack of ordering of the tests, first testing three replicas, then testing two replicas. I've added ordering to the test updates now to run 2 replicas, then three so we can troubleshoot better if this happens again. I'm also adding in a check that the swift-storage units in the second region have settled with the updated replica count before writing the object. It is possible the condition was causing the written object to land on one of the region2 servers, then being replicated back off due to not yet having adopted the replicas=3 ring.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I've submitted an enhancement to the global replication tests to wait for region2 to settle before uploading objects.

https://github.com/openstack-charmers/zaza-openstack-tests/pull/326

Changed in charm-swift-proxy:
assignee: nobody → Drew Freiberger (afreiberger)
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.