Bug #1933639 “periodic master FS 1 fails during tests - tempest....” : Bugs : tripleo

Revision history for this message

wes hayutin (weshayutin) wrote on 2021-06-25:

#1

Please do a little debug, and add to the skip list to give time for more debug.

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-06-25:

#2

theres a lot of failing tests here not sure we want to just skip this we need someone from storage to check it first

i am prepping the skip but I don't think we want to merge it yet

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-06-25:

#3

posted skiplist at https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798119 but see comment #2 above don't think we want to merge it yet

Revision history for this message

Ronelle Landy (rlandy) wrote on 2021-06-28:

#4

https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-periodic-integration-rhos-17/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset035-internal-rhos-17/edee49e/logs/undercloud/var/log/tempest/stestr_results.html.gz

Looks like we are picking up the same set of failures in featureset035 OVB running on rhos-17 - and then some.

https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798119 is currently w-1 from Marios. If we are merging this, we will need to add periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset035-internal-rhos-17 and release wallaby

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-06-28:

#5

This is continuing over the weekend consistently at [1][2][3]

The tests vary slightly between runs but at least in these three the common ones are

        * tempest.api.object_storage.test_account_services.AccountTest
        * tempest.api.object_storage.test_container_services.ContainerTest
        * tempest.api.object_storage.test_object_services.ObjectTest

The skip patch at its present form [4] wont completely unblock us as there are more tests that need adding (plus see rlandy comment #4 we would need to add for d/stream jobs too). I think we should really get someone from storage to dig into this. Skipping is just going to mask/help to lessen the impact of what seems to be some legit problem.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/ac28b2b/logs/undercloud/var/log/tempest/stestr_results.html.gz

[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/c5250fa/logs/undercloud/var/log/tempest/stestr_results.html.gz

[3] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/undercloud/var/log/tempest/stestr_results.html.gz

[4] https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798119/1/roles/validate-tempest/vars/tempest_skip.yml

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-06-28:

#6

Just did some digging but can't find something obvious in the logs

I checked the various swift container outputs at [1] - didn't find any errors there that aren't also in a 'good' run from [2]

Didn't find something recent in the test configuration e.g. in the featureset [3]

Not sure if the patch at [4] is related yet

We need someone from the storage team to dig a bit here please

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/overcloud-controller-0/var/log/containers/stdouts

[2] (good) https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/b2f31c4/logs/overcloud-controller-0/var/log/containers/stdouts

[3] https://opendev.org/openstack/tripleo-quickstart/commits/branch/master/config/general_config/featureset001.yml

[4] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/796807

Revision history for this message

Christian Schwede (cschwede) wrote on 2021-06-28:

#7

/var/log/messages is missing in the jobs output - any chance to get them to check the Swift services? Or access to a reproducer with failing jobs? Unfortunately the rsyslog config for Swift[3][4] seems to be broken again.

The recent patch [1] should be no problem - it just disables rsync in case there is just a single replica; however we do have 3 on the controllers, and the rsync container is running as expected (and even if not, should not result in errors like seen here).

Looking at [2] there was also at least one successful pass after merging that patch.

[1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/796807
[2] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master
[3] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/574221/
[4] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/750487/

Revision history for this message

wes hayutin (weshayutin) wrote on 2021-06-28:

#8

/var/log/messages... is captured in /var/log/extras/journal and journal_errors

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/overcloud-controller-0/var/log/extra/journal.txt.gz

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/overcloud-controller-0/var/log/extra/journal_errors.txt.gz

This is available on all the nodes

I'll add a note to the logs README.
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-06-29:

#9

@Christian see links to journal logs from weshay there ^^^ and per my other comment on trello just now I posted a testproject at https://review.rdoproject.org/r/c/testproject/+/34321 but please ping me with your public key so you can get access once we hold the node

Revision history for this message

Christian Schwede (cschwede) wrote on 2021-06-29:

#10

Thanks Wes & Marios!

I checked all the messages from all Swift services, there is no error that would explain the Tempest failures, all required services are running. In fact there is no error (except a permission denied error for /var/cache/swift/object.recon, but that does not affect the API operations and the exception is catched & logged).
From all the log entries Swift looks like it is doing what it is supposed to do.

Now I'm wondering if this is actually a race condition in Tempest, maybe due to high load/slow IO. I'll continue debugging this.

Revision history for this message

wes hayutin (weshayutin) wrote on 2021-06-29:

#11

Hey Christian...

I want to summarize the findings..

* this is a consistent error in featureset001 upstream master
* it looks like Ronelle found the same failures in OSP-17 on featureset035.

featureset001 and featureset035 are similiar however 35 has ssl and ipv6 enabled.

In the same build where we see this failing on featureset001..

https://trunk.rdoproject.org/api-centos8-master-uc/api/civotes_agg_detail.html?ref_hash=6439b21a91a11b464ad5b2cc147e81cd

fs001 - failing
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/37e4090/logs/undercloud/var/log/tempest/stestr_results.html.gz

fs035 - passing
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6eb1297/logs/undercloud/var/log/tempest/stestr_results.html.gz

standalone - full-api - passing
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-full-tempest-api-master/da56b3e/logs/undercloud/var/log/tempest/stestr_results.html.gz

We are going to skip object_storage tests in featureset001 and CIX the issue to give you time to poke at this.

https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798709/1/roles/validate-tempest/vars/tempest_skip.yml

Since the results are not consistent, it may be tough to determine the root cause. We will also be looking at our configs. Coming to a CIX call to help determine the issue would be helpful.

We are attempting a master promotion w/ object_storage tests running on fs001 / master

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-06-30:

#12

@Christian thanks for digging - I'm still trying to get a hold on the node running at https://review.rdoproject.org/r/c/testproject/+/34321 for debug - it hit node_failure yesterday (https://bugs.launchpad.net/tripleo/+bug/1931226/comments/7) so just tried a recheck I will let you know if/once we get it thanks

As weshay noted in comment 11 we are now skipping tempest.api.object_storage.* (https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/798709) so we can unblock master - however we need to identify and resolve the issue and then restore those tests asap.

Revision history for this message

Bhagyashri Shewale (bhagyashri-shewale) wrote on 2021-07-05:

#13

Hi All,

I checked the periodic master fs001 results and it's consistently failing [1] with this issue.
Looks like few of the tempest tests from all the object storage related classes are failing [2] [3][4].

[1]: https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master&pipeline=openstack-periodic-integration-main
[2]: https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/88f7794/logs/undercloud/var/log/tempest/stestr_results.html.gz
[3]: https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/426e378/logs/undercloud/var/log/tempest/stestr_results.html.gz
[4]: https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/5757229/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-07-05:

#14

commenting on request from irc just now regarding the status.

@Bhagyashris: as noted in the comments above (see comment #12 for latest) Christian has looked here and didn't identify an issue from the logs. He is waiting for us to provide a node for him to debug.

I tried to get one with testproject @ https://review.rdoproject.org/r/c/testproject/+/34321 and asked jpena to put a hold for us, however we were hitting all sorts of NODE_FAILURE issues last week *( https://bugs.launchpad.net/tripleo/+bug/1931226/comments/7). It looks like the hold didn't work there testproject/+/34321.

So can you please try to post a similar patch (or feel free to re-use 34321) and ask one of the rdo folks (e.g. jpena, ykarel, amoralej usually help us with this) to put a hold on it; the hold only works if the job fails but the issue here is consistent so that should not be a problem.

Hope it helps

Revision history for this message

Christian Schwede (cschwede) wrote on 2021-07-05 (last edit on 2021-07-05):

#15

Did some further analyis - looks a bit like there are issues with slow disks maybe. Failing tests differ quite a bit, and most of them are related to container DB updates (or missing ones).

      6 test_copy_object_in_same_container
      5 test_update_object_metadata_with_x_object_manifest
      4 test_versioned_container
      4 test_update_container_metadata_with_delete_metadata_key
      4 test_list_no_account_metadata
      4 test_list_large_object_metadata
      4 test_create_object_with_x_remove_object_meta
      4 test_create_container
      3 test_update_account_metadata_with_create_and_delete_metadata
      3 test_list_all_container_objects_on_deleted_container
      3 test_copy_object_to_itself
      3 test_copy_object_2d_way
      2 test_delete_non_empty_container
      2 test_delete_large_object
      2 test_copy_object_across_containers
      1 test_web_listing_css
      1 test_upload_valid_object
      1 test_upload_large_object
      1 test_update_object_metadata
      1 test_update_container_metadata_with_delete_metadata
      1 test_update_account_metadata_with_delete_metadata_key
      1 test_retrieve_large_object
      1 test_rebuild_server_with_volume_attached
      1 test_get_object_at_expiry_time
      1 test_get_object_after_expiry_time
      1 test_delete_container
      1 test_create_container_with_remove_metadata_value
      1 test_create_container_with_remove_metadata_key

Still no Swift errors found in the logs. Accessible reproducer would be still very helpful.

Revision history for this message

chandan kumar (chkumar246) wrote on 2021-07-06:

#16

Hey @@Christian, I have updated the testproject https://review.rdoproject.org/r/c/testproject/+/34321 here by reverting the skipped object storage tests: https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/799582

I have the job node hold so that we can debug it.

Revision history for this message

Christian Schwede (cschwede) wrote on 2021-07-07 (last edit on 2021-07-07):

#17

Thanks Chandan & Marios for your help!

What I noticed is that Tempest runs sometimes into issues with the eventual consistency due to slow disk writes on some of the nodes. This happens usually on deleting a container that is not empty yet. Tempest sends object delete requests before deleting a container, but these are not finished on all nodes if some of them are still waiting to finish writing the last update to disk.

This is expected behavior due to the eventual consistency. Question is why this was working in the past and is now sometimes failing (the last 3 successful runs are from today[1] with no errors)?

As mentioned above, sometimes there are a few requests pretty slow, and these are the ones that eventually fail. The last failure I was looking into executed object server deletes on average within 18ms, but the failed tests happened when some writes took 1.65, 1.19 and 0.55 seconds and eventual consistency hit in. So something was slowing down disks writes quite a bit.

I'm wondering if anything changed recently in the test environment itself? Slower nodes and/or disks, network issues etc.? Newer Tempest version (I noticed a change that might be an issue, but doesn't explain all the failed tests[2])?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/6fa7419/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://review.opendev.org/c/openstack/tempest/+/774428

Revision history for this message

wes hayutin (weshayutin) wrote on 2021-07-12:

#18

What ever it was.. is gone

https://logserver.rdoproject.org/21/34321/4/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/9d91e3e/logs/undercloud/var/log/tempest/stestr_results.html.gz

Thanks @Christian

Changed in tripleo:
status:	Triaged → Fix Released

tripleo

periodic master FS 1 fails during tests - tempest.api.object_storage

Bug Description

Other bug subscribers

Remote bug watches