Tempest tests failed with the error: 401 Unauthorized error

Bug #1621798 reported by Sofiia Andriichenko
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Confirmed
Medium
Unassigned
9.x
Confirmed
Medium
Unassigned

Bug Description

A lot of tempest tests(ironic, nova) are failed on CI 9.1 snapshot #237.
It can not be reproduced manual.

Failed tests:
Tempest (Ceph,SSL,DVR,Sahara,Ceilometer)- 279 Failed
Tempest (Ceph,DVR,Sahara,Ceilometer) - 74 Failed
Tempest (SSL,LVM,Cinder,DVR,Sahara,Ceilometer,Ironic) - 144 Failed

Configurations:
Tempest-9.0_LVM_Cinder_DVR_Sahara_Ceilometer_Ironic_SSL
Tempest-9.0_Ceph_DVR_Sahara_Ceilometer_Ironic
Tempest-9.0_Ceph_DVR_Sahara_Ceilometer_Ironic_SSL
Tempest-9.0_Detached_RabbitMQ

Jenkins Tempest:
http://cz7776.bud.mirantis.net:8080/jenkins/view/Tempest_9.%D0%A5/

For Tempest-9.0_LVM_Cinder_DVR_Sahara_Ceilometer_Ironic_SSL:
151 failures, report with failed tests http://cz7776.bud.mirantis.net:8080/jenkins/view/Tempest_9.%D0%A5/job/9.x_Tempest_LVM_SSL/lastSuccessfulBuild/artifact/report.xml

summary: - A lot of tempest tests are failed on CI
+ More than20% of tempest tests triggers to an Error on CI
Changed in mos:
status: New → Confirmed
importance: Undecided → High
summary: - More than20% of tempest tests triggers to an Error on CI
+ More than15% of tempest tests triggers to an Error on CI
description: updated
description: updated
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote : Re: More than15% of tempest tests triggers to an Error on CI

What is the root of the issue?

Revision history for this message
Alexander Gubanov (ogubanov) wrote :

All failed tests related to keystone (the same traceback for all tests):

...
raise exceptions.Unauthorized(resp_body, resp=resp)
tempest.lib.exceptions.Unauthorized: Unauthorized
Details: {u'message': u'The request you have made requires authentication.', u'code': 401, u'title': u'Unauthorized'}

For example: https://mirantis.testrail.com/index.php?/tests/view/15365287&group_by=cases:section_id&group_order=asc&group_id=49557

Revision history for this message
Alexander Gubanov (ogubanov) wrote :
Changed in mos:
assignee: Alexander Gubanov (ogubanov) → MOS QA Team (mos-qa)
Changed in mos:
importance: High → Critical
assignee: MOS QA Team (mos-qa) → MOS Keystone (mos-keystone)
tags: added: area-keystone
Revision history for this message
Alexander Makarov (amakarov) wrote :

401 response is "not authorized" - not necessarily "an error in keystone".

Revision history for this message
Alexander Makarov (amakarov) wrote :

No errors in keystone logs. I suggest checking haproxy and apache logs.

Changed in mos:
assignee: MOS Keystone (mos-keystone) → MOS QA Team (mos-qa)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

QA team is working on debugging the issue, it looks like the root of the issue in QA scripts / Rally or Tempest configuration.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

We can see the following on the test environment when tests failed:

root@node-2:~# haproxy-status.sh | grep DOWN
object-storage node-1 Status: DOWN/L4TOUT Sessions: 0 Rate: 0
object-storage node-3 Status: DOWN/L4TOUT Sessions: 0 Rate: 0

root@node-3:~# ceph status
    cluster 3306b9f1-7582-4079-b1e8-18de068de956
     health HEALTH_WARN
            too many PGs per OSD (352 > max 300)
     monmap e3: 3 mons at {node-1=10.109.7.3:6789/0,node-2=10.109.7.2:6789/0,node-3=10.109.7.1:6789/0}
            election epoch 18, quorum 0,1,2 node-3,node-2,node-1
     osdmap e246: 4 osds: 4 up, 4 in
      pgmap v4960: 704 pgs, 10 pools, 172 MB data, 98 objects
            8709 MB used, 188 GB / 197 GB avail
                 704 active+clean

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The issue didn't reproduce on my custom environment with 3 controllers, 2 compute nodes, Ceph, Neutron VLAN. I used virtual lab with 4Gb for each VM.

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Boris Bobrov (bbobrov) wrote :

We are still investigating it from keystone side with Oleksii Butenko

Revision history for this message
Boris Bobrov (bbobrov) wrote :

Right now i am checking whether this bug happens due to cache issues by enabling and disabling cache.

Revision history for this message
Boris Bobrov (bbobrov) wrote :

There is a fix for this in upstream: https://review.openstack.org/#/c/369618/ . I am now waiting for reviews from other people, and if their reviews will be positive i will pull it to downstream branches.

Also I am marking it as fix committed in 10.0 because it was merged in upstream newton.

Changed in mos:
assignee: MOS QA Team (mos-qa) → Boris Bobrov (bbobrov)
Revision history for this message
Boris Bobrov (bbobrov) wrote :

Please disregard the comment ^, it was supposed to go to https://bugs.launchpad.net/mos/+bug/1582220 :(

no longer affects: mos/10.0.x
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

We changed the strategy of the debugging, the root of the issue is unknown for the moment.

I will update the status in next 6 hours.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The update:

We have manually deployed 4 different OpenStack configurations (the same configurations as we used before on CI) and manually started Tempest tests on these configurations. One configuration failed on OSTF check [2], 3 other configurations were deployed correctly and Tempest tests suite successfully passed for these configurations (results in TestRail for MOS 9.1 snapshot #266: [3][4][5]).

Short Summary:
One of the default OpenStack configurations failed OSTF tests [2], 2 other default configurations successfully passed Tempest test suite (summary pass rate 97%), the execution of Tempest on the one configuration is still in progress.

We need to investigate the failed deployment [2] and then restart Tempest tests for this configuration to make sure that we have the high pass rate for all required OpenStack configurations. We also need to wait when we execute Tempest on all default configuration to make sure we don't have any other issues.

I suggest we do not make HCF until the issue [2] will be investigated and fixed to make sure we don't have any important regression issues.

Note: QA team is working on the debugging and fixing of Tempest CI jobs to avoid any false-negative failures in the future. We will run Tempest test suite and upload results to TestRail manually every day while the CI jobs won't be fixed.

[1] https://bugs.launchpad.net/mos/9.x/+bug/1621798
[2] https://bugs.launchpad.net/fuel/+bug/1623934
[3] https://mirantis.testrail.com/index.php?/runs/view/20755&group_by=cases:section_id&group_order=asc (97% pass rate)
[4] https://mirantis.testrail.com/index.php?/runs/view/20754&group_by=cases:section_id&group_order=asc (99% pass rate)
[5] https://mirantis.testrail.com/index.php?/runs/view/20756&group_by=cases:section_id&group_order=asc (94% pass rate)

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Today we have manually executed Tempest on 4 different OpenStack configurations, the results are good [1][2][3][4] (summary pass rate is 98.5%). CI jobs are still broken for the moment and we are working on the fixes.

We also applied several fixes in Tempest and QA scripts to fix the issue on CI:
1) We removed mongo database from the controller nodes
2) We added default configurations with 1 controller to check if the issue reproduced on environment with one controller
3) We added the custom change in Tempest [5] to get the detailed information about the errors with 401 status code (the root of the issue with Tempest results degradation). We expect that this change will help to understand what is going on with keystone during the tests.
4) We removed parallel execution of Tempest tests from CI jobs and now we run all tests in one thread to avoid any raise condition issues.
5) We added new servers to run different deployments / tests with different fixes in parallel.

Today we also merged the fix to Keystone [6] which fixes another Tempest-related issue [7] and we expect that the fix for the issue can also help to fix low Tempest pass rate.

Short Summary:
We still have an issue with Tempest on CI, the root of the issue is unknown yet but the issue which was fixed today in Keystone can probably affect it, we are waiting for the nightly jobs to check it.
We can't reproduce it manually, this is why we think it is not a blocker issue for the product, OpenStack components work fine and we just need to finally debug the issue to make sure that we don't have any hard-reproducible critical defects in the product.

We will continue to run Tempest test suite manually even on the weekends until we fix the issue with CI jobs.

[1] https://mirantis.testrail.com/index.php?/runs/view/20833&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[2] https://mirantis.testrail.com/index.php?/runs/view/20835&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[3] https://mirantis.testrail.com/index.php?/runs/view/20870&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[4] https://mirantis.testrail.com/index.php?/runs/view/20874&group_by=cases:section_id&group_order=asc (pass rate: 98%)
[5] https://review.openstack.org/#/c/371418/
[6] https://review.fuel-infra.org/#/c/26523/
[7] https://bugs.launchpad.net/mos/+bug/1582220

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

I've reproduced the issue by manual execution of the following script:

http://paste.openstack.org/show/580294/

On lab with 1 OpenStack controller node everything works fine (at least 1000 iterations, 20 parallel threads)
On lab with 3 OpenStack controller nodes we can the following:

http://paste.openstack.org/show/580293/

Revision history for this message
Alexander Makarov (amakarov) wrote :

Keystone response time statistics would be handy.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

We described the separate issue with Keystone configuration:

https://bugs.launchpad.net/mos/+bug/1625077

and we think it can help to fix the issue which is described in this bug report.

We are going to check the idea on CI.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

We have also described the issue Keystone logs:

https://bugs.launchpad.net/keystone/+bug/1625120

The root of the issue in the security requirements, in the result we don't see any logs about 401 / 403 errors code in Keystone logs and as a result it is not possible to understand why sometimes we got 401 error.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

We have found the following error in keystone logs:

/var/log/keystone/keystone-admin.log:2016-09-20 15:40:32.339 31603 DEBUG keystone.common.sql.core [req-18c19603-75af-43ac-8ae9-341e4835a6f4 1398ffb9caf94d6f8a048ba75125bccd a528d87953d146bbb697e3cef74c138a - default default] Conflict project: (_mysql_exceptions.IntegrityError) (1062, "Duplicate entry 'default' for key 'PRIMARY'") [SQL: u'INSERT INTO project (id, name, domain_id, description, enabled, extra, parent_id, is_domain) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: ('default', 'Default', '<<keystone.domain.root>>', 'Domain created automatically to support V2.0 operations.', 1, '{}', None, 1)] wrapper /usr/lib/python2.7/dist-packages/keystone/common/sql/core.py:411

summary: - More than15% of tempest tests triggers to an Error on CI
+ More than15% of tempest tests triggers to an Error on CI: 401
+ Unauthorized error
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote : Re: More than15% of tempest tests triggers to an Error on CI: 401 Unauthorized error

Keystone team have also proposed the patch which can probably fix the issue:

https://review.fuel-infra.org/#/c/26664/

We are going to test the patch on Tempest CI

Revision history for this message
Alexander Makarov (amakarov) wrote :

401/403 verbose responses downstream patch: https://review.fuel-infra.org/#/c/26689/

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The fix for race conditions is on review for Mitaka:
https://review.fuel-infra.org/#/c/26664/

(this fix is already merged to master branch.)

summary: - More than15% of tempest tests triggers to an Error on CI: 401
+ More than 15% of tempest tests triggers to an Error on CI: 401
Unauthorized error
Revision history for this message
Oleksiy Butenko (obutenko) wrote : Re: More than 15% of tempest tests triggers to an Error on CI: 401 Unauthorized error

verified on 9.1 snapshot #299

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The priority is High, the issue reproduced not in 100% of cases and manual run of Tempest tests works fine.

summary: - More than 15% of tempest tests triggers to an Error on CI: 401
- Unauthorized error
+ Tempest tests failed with the error: 401 Unauthorized error
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The issue reproduced again on MOS 9.1 snapshot #347 (after RC3):

https://mirantis.testrail.com/index.php?/runs/view/22608&group_by=cases:section_id&group_order=asc&group_id=49558

401 Unauthorized.

Revision history for this message
Alexander Makarov (amakarov) wrote :

I cannot merge https://review.fuel-infra.org/#/c/27042/ due to HCF, so please apply it manually. It should give some additional information on what's happening.

Revision history for this message
Alexander Makarov (amakarov) wrote :

Sorry, wring link ^
Here is the correct one: https://review.fuel-infra.org/#/c/26689/

Revision history for this message
Yury Tregubov (ytregubov) wrote :

The same problem is seen on mos-integration tests for FUEL LDAP plugin.

Test steps are the following:
        1. Check that LDAP plugin is installed
        2. Login as admin/admin, domain: default
        3. Create a new project
        4. Set role 'admin' to 'user01' in domain openldap1 and project
        5. Relogin as user01/1111, domain: openldap1
        6. Create a network and subnetwork for the new project
        7. Create an instance
        8. Repeat steps 2-7 for domain AD2

The problem occurs on step 5.

Actual place in test code where the fault was couch is https://github.com/Mirantis/mos-integration-tests/blob/master/mos_tests/keystone/test_keystone_ldap.py#L893

Example of the test fail is available here https://mirantis.testrail.com/index.php?/tests/view/19035866

Revision history for this message
Boris Bobrov (bbobrov) wrote :

Yury, does the problem happen every time? If yes, it is unrelated to this bugreport and you probably need to open a new one.

Revision history for this message
Boris Bobrov (bbobrov) wrote :

I found out that i did not comment on this bug.

Here is what debugging showed.

`RolesNegativeTestJSON:test_delete_role_by_unauthorized_user` fetches a token;
`RolesNegativeTestJSON:test_delete_role_request_without_token` deletes this token;
`RolesNegativeTestJSON:test_list_roles_request_without_token` tries to delete this token again.

3 different tests try to operate with the same token. Debug output shows that this token came from keystone only once. So, tempest for some reason attempts to reuse the token multiple times.

http://paste.openstack.org/show/584887/ -- here is this part. It is very
verbose, expect to spend some time there. Token starts with
gAAAAABX98nmFrAEeMZlcpefA9jsL.

The bugreport should also be marked as not affecting production.

Revision history for this message
Boris Bobrov (bbobrov) wrote :

I want to disable caching in tempest and see if it fixes the issue, and if not, insert more debug statements. Unfortunately, i am not familiar with our test running suite. That's why i am moving the bugreport to MOS QA because i need their help to continue debugging.

MOS QA, please ping me when you have time.

Revision history for this message
Maksym Shalamov (mshalamov) wrote :

Hello Boris,

We will provide to you all required help from our side.

I scheduled the meeting where we will discuss required steps which help to recognize root cause of the issue.

Best regards,

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

ETA for RCA: Dec 23.

Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :

This bug is reproducing once per six month.
Lately we can't reproduce it.
When it will reproduced we (QA team) will collect logs and provide environment for debug

Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Bug reproduced 9.2 snapshot #664
Today, I can provide environment for debug
snapshot: https://drive.google.com/a/mirantis.com/file/d/0BxPLDs6wcpbDdDhZS1ZiR3Y4QW8/view?usp=sharing

Revision history for this message
Boris Bobrov (bbobrov) wrote :

The issue was not reproducible on the env described above.

I am not sure why MOS Keystone was assigned, because we figured out that the issue is in tempest. I am moving the bugreport back to QA.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

So dev team can't reproduce the issue, the bug marked as Incomplete until it wasn't reproduced again and QA team will identify the exact steps to reproduce.

Revision history for this message
Oleksiy Butenko (obutenko) wrote :

if you need access to env and steps to reproduce - contact me
reproduced again on MOS 9.2 snapshot #744

tags: added: blocker-for-qa
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/keystone (mcp/newton)

Fix proposed to branch: mcp/newton
Change author: Lance Bragstad <email address hidden>
Review: https://review.fuel-infra.org/33600

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/keystone (11.0/ocata)

Fix proposed to branch: 11.0/ocata
Change author: Lance Bragstad <email address hidden>
Review: https://review.fuel-infra.org/34091

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/keystone (mcp/ocata)

Fix proposed to branch: mcp/ocata
Change author: Lance Bragstad <email address hidden>
Review: https://review.fuel-infra.org/34796

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/keystone (mcp/ocata)

Change abandoned by Roman Podoliaka <email address hidden> on branch: mcp/ocata
Review: https://review.fuel-infra.org/34796
Reason: in favor of Iaee0ec8c7acd512b9d93096ce8306a2952061c7a included in upstream stable/ocata that we sync

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/keystone (mcp/newton)

Change abandoned by Roman Podoliaka <email address hidden> on branch: mcp/newton
Review: https://review.fuel-infra.org/33600
Reason: in favor of Iaee0ec8c7acd512b9d93096ce8306a2952061c7a that is included to upstream stable/newton that we sync

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/keystone (11.0/ocata)

Change abandoned by Roman Podoliaka <email address hidden> on branch: 11.0/ocata
Review: https://review.fuel-infra.org/34091
Reason: we don't use 11.0/ocata anymore - mcp/ocata is the correct branch name

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.