periodic master FS 1 fails during tests - tempest.api.object_storage

Bug #1933639 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

At [1][2][3] the periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master fails during the tempest run on the following tests:

        * tempest.api.object_storage.test_container_services.ContainerTest
        * tempest.api.object_storage.test_container_services_negative.ContainerNegativeTest
        * tempest.api.object_storage.test_object_services.ObjectTest
        * tempest.api.object_storage.test_object_slo.ObjectSloTest

These seem to be common but at [3] there is also tempest.api.object_storage.test_container_sync_middleware and tempest.api.object_storage.test_account_services.AccountTest as well.

This is a master promotion blocker.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/a355292/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/be271d7/logs/undercloud/var/log/tempest/stestr_results.html.gz
[3] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/df64a68/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message
wes hayutin (weshayutin) wrote :

Please do a little debug, and add to the skip list to give time for more debug.

Revision history for this message
Marios Andreou (marios-b) wrote :

theres a lot of failing tests here not sure we want to just skip this we need someone from storage to check it first

i am prepping the skip but I don't think we want to merge it yet

Revision history for this message
Marios Andreou (marios-b) wrote :

posted skiplist at https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798119 but see comment #2 above don't think we want to merge it yet

Revision history for this message
Ronelle Landy (rlandy) wrote :

https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-17/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset035-internal-rhos-17/edee49e/logs/undercloud/var/log/tempest/stestr_results.html.gz

Looks like we are picking up the same set of failures in featureset035 OVB running on rhos-17 - and then some.

https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798119 is currently w-1 from Marios. If we are merging this, we will need to add periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset035-internal-rhos-17 and release wallaby

Revision history for this message
Marios Andreou (marios-b) wrote :

This is continuing over the weekend consistently at [1][2][3]

The tests vary slightly between runs but at least in these three the common ones are

        * tempest.api.object_storage.test_account_services.AccountTest
        * tempest.api.object_storage.test_container_services.ContainerTest
        * tempest.api.object_storage.test_object_services.ObjectTest

The skip patch at its present form [4] wont completely unblock us as there are more tests that need adding (plus see rlandy comment #4 we would need to add for d/stream jobs too). I think we should really get someone from storage to dig into this. Skipping is just going to mask/help to lessen the impact of what seems to be some legit problem.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/ac28b2b/logs/undercloud/var/log/tempest/stestr_results.html.gz

[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/c5250fa/logs/undercloud/var/log/tempest/stestr_results.html.gz

[3] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/undercloud/var/log/tempest/stestr_results.html.gz

[4] https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798119/1/roles/validate-tempest/vars/tempest_skip.yml

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Christian Schwede (cschwede) wrote :

/var/log/messages is missing in the jobs output - any chance to get them to check the Swift services? Or access to a reproducer with failing jobs? Unfortunately the rsyslog config for Swift[3][4] seems to be broken again.

The recent patch [1] should be no problem - it just disables rsync in case there is just a single replica; however we do have 3 on the controllers, and the rsync container is running as expected (and even if not, should not result in errors like seen here).

Looking at [2] there was also at least one successful pass after merging that patch.

[1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/796807
[2] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master
[3] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/574221/
[4] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/750487/

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

@Christian see links to journal logs from weshay there ^^^ and per my other comment on trello just now I posted a testproject at https://review.rdoproject.org/r/c/testproject/+/34321 but please ping me with your public key so you can get access once we hold the node

Revision history for this message
Christian Schwede (cschwede) wrote :

Thanks Wes & Marios!

I checked all the messages from all Swift services, there is no error that would explain the Tempest failures, all required services are running. In fact there is no error (except a permission denied error for /var/cache/swift/object.recon, but that does not affect the API operations and the exception is catched & logged).
From all the log entries Swift looks like it is doing what it is supposed to do.

Now I'm wondering if this is actually a race condition in Tempest, maybe due to high load/slow IO. I'll continue debugging this.

Revision history for this message
wes hayutin (weshayutin) wrote :

Hey Christian...

I want to summarize the findings..

* this is a consistent error in featureset001 upstream master
* it looks like Ronelle found the same failures in OSP-17 on featureset035.

featureset001 and featureset035 are similiar however 35 has ssl and ipv6 enabled.

In the same build where we see this failing on featureset001..

https://trunk.rdoproject.org/api-centos8-master-uc/api/civotes_agg_detail.html?ref_hash=6439b21a91a11b464ad5b2cc147e81cd

fs001 - failing
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/undercloud/var/log/tempest/stestr_results.html.gz

fs035 - passing
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6eb1297/logs/undercloud/var/log/tempest/stestr_results.html.gz

standalone - full-api - passing
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-full-tempest-api-master/da56b3e/logs/undercloud/var/log/tempest/stestr_results.html.gz

We are going to skip object_storage tests in featureset001 and CIX the issue to give you time to poke at this.

https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798709/1/roles/validate-tempest/vars/tempest_skip.yml

Since the results are not consistent, it may be tough to determine the root cause. We will also be looking at our configs. Coming to a CIX call to help determine the issue would be helpful.

We are attempting a master promotion w/ object_storage tests running on fs001 / master

Revision history for this message
Marios Andreou (marios-b) wrote :

@Christian thanks for digging - I'm still trying to get a hold on the node running at https://review.rdoproject.org/r/c/testproject/+/34321 for debug - it hit node_failure yesterday (https://bugs.launchpad.net/tripleo/+bug/1931226/comments/7) so just tried a recheck I will let you know if/once we get it thanks

As weshay noted in comment 11 we are now skipping tempest.api.object_storage.* (https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798709) so we can unblock master - however we need to identify and resolve the issue and then restore those tests asap.

Revision history for this message
Bhagyashri Shewale (bhagyashri-shewale) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

commenting on request from irc just now regarding the status.

@Bhagyashris: as noted in the comments above (see comment #12 for latest) Christian has looked here and didn't identify an issue from the logs. He is waiting for us to provide a node for him to debug.

I tried to get one with testproject @ https://review.rdoproject.org/r/c/testproject/+/34321 and asked jpena to put a hold for us, however we were hitting all sorts of NODE_FAILURE issues last week *( https://bugs.launchpad.net/tripleo/+bug/1931226/comments/7). It looks like the hold didn't work there testproject/+/34321.

So can you please try to post a similar patch (or feel free to re-use 34321) and ask one of the rdo folks (e.g. jpena, ykarel, amoralej usually help us with this) to put a hold on it; the hold only works if the job fails but the issue here is consistent so that should not be a problem.

Hope it helps

Revision history for this message
Christian Schwede (cschwede) wrote (last edit ):

Did some further analyis - looks a bit like there are issues with slow disks maybe. Failing tests differ quite a bit, and most of them are related to container DB updates (or missing ones).

      6 test_copy_object_in_same_container
      5 test_update_object_metadata_with_x_object_manifest
      4 test_versioned_container
      4 test_update_container_metadata_with_delete_metadata_key
      4 test_list_no_account_metadata
      4 test_list_large_object_metadata
      4 test_create_object_with_x_remove_object_meta
      4 test_create_container
      3 test_update_account_metadata_with_create_and_delete_metadata
      3 test_list_all_container_objects_on_deleted_container
      3 test_copy_object_to_itself
      3 test_copy_object_2d_way
      2 test_delete_non_empty_container
      2 test_delete_large_object
      2 test_copy_object_across_containers
      1 test_web_listing_css
      1 test_upload_valid_object
      1 test_upload_large_object
      1 test_update_object_metadata
      1 test_update_container_metadata_with_delete_metadata
      1 test_update_account_metadata_with_delete_metadata_key
      1 test_retrieve_large_object
      1 test_rebuild_server_with_volume_attached
      1 test_get_object_at_expiry_time
      1 test_get_object_after_expiry_time
      1 test_delete_container
      1 test_create_container_with_remove_metadata_value
      1 test_create_container_with_remove_metadata_key

Still no Swift errors found in the logs. Accessible reproducer would be still very helpful.

Revision history for this message
chandan kumar (chkumar246) wrote :

Hey @@Christian, I have updated the testproject https://review.rdoproject.org/r/c/testproject/+/34321 here by reverting the skipped object storage tests: https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/799582

I have the job node hold so that we can debug it.

Revision history for this message
Christian Schwede (cschwede) wrote (last edit ):

Thanks Chandan & Marios for your help!

What I noticed is that Tempest runs sometimes into issues with the eventual consistency due to slow disk writes on some of the nodes. This happens usually on deleting a container that is not empty yet. Tempest sends object delete requests before deleting a container, but these are not finished on all nodes if some of them are still waiting to finish writing the last update to disk.

This is expected behavior due to the eventual consistency. Question is why this was working in the past and is now sometimes failing (the last 3 successful runs are from today[1] with no errors)?

As mentioned above, sometimes there are a few requests pretty slow, and these are the ones that eventually fail. The last failure I was looking into executed object server deletes on average within 18ms, but the failed tests happened when some writes took 1.65, 1.19 and 0.55 seconds and eventual consistency hit in. So something was slowing down disks writes quite a bit.

I'm wondering if anything changed recently in the test environment itself? Slower nodes and/or disks, network issues etc.? Newer Tempest version (I noticed a change that might be an issue, but doesn't explain all the failed tests[2])?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6fa7419/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://review.opendev.org/c/openstack/tempest/+/774428

Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.