Bug #1621798 “Tempest tests failed with the error: 401 Unauthori...” : Series 9.x : Bugs : Mirantis OpenStack

Sergii Turivnyi (sturivnyi) on 2016-09-09

summary:	- A lot of tempest tests are failed on CI + More than20% of tempest tests triggers to an Error on CI
Changed in mos:
status:	New → Confirmed
importance:	Undecided → High
summary:	- More than20% of tempest tests triggers to an Error on CI + More than15% of tempest tests triggers to an Error on CI

Sergii Turivnyi (sturivnyi) on 2016-09-09

description:	updated
description:	updated

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-09: Re: More than15% of tempest tests triggers to an Error on CI

#1

What is the root of the issue?

Revision history for this message

Alexander Gubanov (ogubanov) wrote on 2016-09-09:

#2

All failed tests related to keystone (the same traceback for all tests):

...
raise exceptions.Unauthorized(resp_body, resp=resp)
tempest.lib.exceptions.Unauthorized: Unauthorized
Details: {u'message': u'The request you have made requires authentication.', u'code': 401, u'title': u'Unauthorized'}

For example: https://mirantis.testrail.com/index.php?/tests/view/15365287&group_by=cases:section_id&group_order=asc&group_id=49557

Revision history for this message

Alexander Gubanov (ogubanov) wrote on 2016-09-09:

#3

Here is diagnostic snapshot http://srv97-bud.infra.mirantis.net/snapshots/fuel-snapshot-2016-09-09_13-34-46.tar for further debugging.

Changed in mos:
assignee:	Alexander Gubanov (ogubanov) → MOS QA Team (mos-qa)

Sergii Turivnyi (sturivnyi) on 2016-09-09

Changed in mos:
importance:	High → Critical
assignee:	MOS QA Team (mos-qa) → MOS Keystone (mos-keystone)

Vitaly Sedelnik (vsedelnik) on 2016-09-09

tags:

added: area-keystone

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-09-09:

#4

401 response is "not authorized" - not necessarily "an error in keystone".

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-09-09:

#5

No errors in keystone logs. I suggest checking haproxy and apache logs.

Changed in mos:
assignee:	MOS Keystone (mos-keystone) → MOS QA Team (mos-qa)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-12:

#6

QA team is working on debugging the issue, it looks like the root of the issue in QA scripts / Rally or Tempest configuration.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-14:

#7

We can see the following on the test environment when tests failed:

root@node-2:~# haproxy-status.sh | grep DOWN
object-storage node-1 Status: DOWN/L4TOUT Sessions: 0 Rate: 0
object-storage node-3 Status: DOWN/L4TOUT Sessions: 0 Rate: 0

root@node-3:~# ceph status
    cluster 3306b9f1-7582-4079-b1e8-18de068de956
     health HEALTH_WARN
            too many PGs per OSD (352 > max 300)
     monmap e3: 3 mons at {node-1=10.109.7.3:6789/0,node-2=10.109.7.2:6789/0,node-3=10.109.7.1:6789/0}
            election epoch 18, quorum 0,1,2 node-3,node-2,node-1
     osdmap e246: 4 osds: 4 up, 4 in
      pgmap v4960: 704 pgs, 10 pools, 172 MB data, 98 objects
            8709 MB used, 188 GB / 197 GB avail
                 704 active+clean

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-14:

#9

tempest-report.html Edit (993.8 KiB, text/html)

The issue didn't reproduce on my custom environment with 3 controllers, 2 compute nodes, Ceph, Neutron VLAN. I used virtual lab with 4Gb for each VM.

Changed in mos:
status:	Confirmed → Incomplete

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-09-14:

#10

We are still investigating it from keystone side with Oleksii Butenko

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-09-14:

#11

Right now i am checking whether this bug happens due to cache issues by enabling and disabling cache.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-09-14:

#12

There is a fix for this in upstream: https://review.openstack.org/#/c/369618/ . I am now waiting for reviews from other people, and if their reviews will be positive i will pull it to downstream branches.

Also I am marking it as fix committed in 10.0 because it was merged in upstream newton.

Timur Nurlygayanov (tnurlygayanov) on 2016-09-14

Changed in mos:
assignee:	MOS QA Team (mos-qa) → Boris Bobrov (bbobrov)

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-09-14:

#13

Please disregard the comment ^, it was supposed to go to https://bugs.launchpad.net/mos/+bug/1582220 :(

no longer affects:

mos/10.0.x

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-15:

#14

We changed the strategy of the debugging, the root of the issue is unknown for the moment.

I will update the status in next 6 hours.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-15:

#15

The update:

We have manually deployed 4 different OpenStack configurations (the same configurations as we used before on CI) and manually started Tempest tests on these configurations. One configuration failed on OSTF check [2], 3 other configurations were deployed correctly and Tempest tests suite successfully passed for these configurations (results in TestRail for MOS 9.1 snapshot #266: [3][4][5]).

Short Summary:
One of the default OpenStack configurations failed OSTF tests [2], 2 other default configurations successfully passed Tempest test suite (summary pass rate 97%), the execution of Tempest on the one configuration is still in progress.

We need to investigate the failed deployment [2] and then restart Tempest tests for this configuration to make sure that we have the high pass rate for all required OpenStack configurations. We also need to wait when we execute Tempest on all default configuration to make sure we don't have any other issues.

I suggest we do not make HCF until the issue [2] will be investigated and fixed to make sure we don't have any important regression issues.

Note: QA team is working on the debugging and fixing of Tempest CI jobs to avoid any false-negative failures in the future. We will run Tempest test suite and upload results to TestRail manually every day while the CI jobs won't be fixed.

[1] https://bugs.launchpad.net/mos/9.x/+bug/1621798
[2] https://bugs.launchpad.net/fuel/+bug/1623934
[3] https://mirantis.testrail.com/index.php?/runs/view/20755&group_by=cases:section_id&group_order=asc (97% pass rate)
[4] https://mirantis.testrail.com/index.php?/runs/view/20754&group_by=cases:section_id&group_order=asc (99% pass rate)
[5] https://mirantis.testrail.com/index.php?/runs/view/20756&group_by=cases:section_id&group_order=asc (94% pass rate)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-16:

#16

Today we have manually executed Tempest on 4 different OpenStack configurations, the results are good [1][2][3][4] (summary pass rate is 98.5%). CI jobs are still broken for the moment and we are working on the fixes.

We also applied several fixes in Tempest and QA scripts to fix the issue on CI:
1) We removed mongo database from the controller nodes
2) We added default configurations with 1 controller to check if the issue reproduced on environment with one controller
3) We added the custom change in Tempest [5] to get the detailed information about the errors with 401 status code (the root of the issue with Tempest results degradation). We expect that this change will help to understand what is going on with keystone during the tests.
4) We removed parallel execution of Tempest tests from CI jobs and now we run all tests in one thread to avoid any raise condition issues.
5) We added new servers to run different deployments / tests with different fixes in parallel.

Today we also merged the fix to Keystone [6] which fixes another Tempest-related issue [7] and we expect that the fix for the issue can also help to fix low Tempest pass rate.

Short Summary:
We still have an issue with Tempest on CI, the root of the issue is unknown yet but the issue which was fixed today in Keystone can probably affect it, we are waiting for the nightly jobs to check it.
We can't reproduce it manually, this is why we think it is not a blocker issue for the product, OpenStack components work fine and we just need to finally debug the issue to make sure that we don't have any hard-reproducible critical defects in the product.

We will continue to run Tempest test suite manually even on the weekends until we fix the issue with CI jobs.

[1] https://mirantis.testrail.com/index.php?/runs/view/20833&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[2] https://mirantis.testrail.com/index.php?/runs/view/20835&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[3] https://mirantis.testrail.com/index.php?/runs/view/20870&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[4] https://mirantis.testrail.com/index.php?/runs/view/20874&group_by=cases:section_id&group_order=asc (pass rate: 98%)
[5] https://review.openstack.org/#/c/371418/
[6] https://review.fuel-infra.org/#/c/26523/
[7] https://bugs.launchpad.net/mos/+bug/1582220

Today we have manually executed Tempest on 4 different OpenStack configurations, the results are good [1][2][3][4] (summary pass rate is 98.5%). CI jobs are still broken for the moment and we are working on the fixes.

We also applied several fixes in Tempest and QA scripts to fix the issue on CI:
1) We removed mongo database from the controller nodes
2) We added default configurations with 1 controller to check if the issue reproduced on environment with one controller
3) We added the custom change in Tempest [5] to get the detailed information about the errors with 401 status code (the root of the issue with Tempest results degradation). We expect that this change will help to understand what is going on with keystone during the tests.
4) We removed parallel execution of Tempest tests from CI jobs and now we run all tests in one thread to avoid any raise condition issues.
5) We added new servers to run different deployments / tests with different fixes in parallel.

Today we also merged the fix to Keystone [6] which fixes another Tempest-related issue [7] and we expect that the fix for the issue can also help to fix low Tempest pass rate.

Short Summary:
We still have an issue with Tempest on CI, the root of the issue is unknown yet but the issue which was fixed today in Keystone can probably affect it, we are waiting for the nightly jobs to check it.
We can't reproduce it manually, this is why we think it is not a blocker issue for the product, OpenStack components work fine and we just need to finally debug the issue to make sure that we don't have any hard-reproducible critical defects in the product.

We will continue to run Tempest test suite manually even on the weekends until we fix the issue with CI jobs.

[1] https://mirantis.testrail.com/index.php?/runs/view/20833&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[2] https://mirantis.testrail.com/index.php?/runs/view/20835&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[3] https://mirantis.testrail.com/index.php?/runs/view/20870&group_by=cases:section_id&group_order=asc (pass rate: 99%)
[4] https://mirantis.testrail.com/index.php?/runs/view/20874&group_by=cases:section_id&group_order=asc (pass rate: 98%)
[5] https://review.openstack.org/#/c/371418/
[6] https://review.fuel-infra.org/#/c/26523/
[7] https://bugs.launchpad.net/mos/+bug/1582220

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-16:

#17

I've reproduced the issue by manual execution of the following script:

http://paste.openstack.org/show/580294/

On lab with 1 OpenStack controller node everything works fine (at least 1000 iterations, 20 parallel threads)
On lab with 3 OpenStack controller nodes we can the following:

http://paste.openstack.org/show/580293/

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-09-19:

#18

Keystone response time statistics would be handy.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-19:

#19

We described the separate issue with Keystone configuration:

https://bugs.launchpad.net/mos/+bug/1625077

and we think it can help to fix the issue which is described in this bug report.

We are going to check the idea on CI.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-19:

#20

We have also described the issue Keystone logs:

https://bugs.launchpad.net/keystone/+bug/1625120

The root of the issue in the security requirements, in the result we don't see any logs about 401 / 403 errors code in Keystone logs and as a result it is not possible to understand why sometimes we got 401 error.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-20:

#21

We have found the following error in keystone logs:

/var/log/keystone/keystone-admin.log:2016-09-20 15:40:32.339 31603 DEBUG keystone.common.sql.core [req-18c19603-75af-43ac-8ae9-341e4835a6f4 1398ffb9caf94d6f8a048ba75125bccd a528d87953d146bbb697e3cef74c138a - default default] Conflict project: (_mysql_exceptions.IntegrityError) (1062, "Duplicate entry 'default' for key 'PRIMARY'") [SQL: u'INSERT INTO project (id, name, domain_id, description, enabled, extra, parent_id, is_domain) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: ('default', 'Default', '<<keystone.domain.root>>', 'Domain created automatically to support V2.0 operations.', 1, '{}', None, 1)] wrapper /usr/lib/python2.7/dist-packages/keystone/common/sql/core.py:411

summary:

- More than15% of tempest tests triggers to an Error on CI
+ More than15% of tempest tests triggers to an Error on CI: 401
+ Unauthorized error

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-21: Re: More than15% of tempest tests triggers to an Error on CI: 401 Unauthorized error

#22

Keystone team have also proposed the patch which can probably fix the issue:

https://review.fuel-infra.org/#/c/26664/

We are going to test the patch on Tempest CI

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-09-21:

#23

401/403 verbose responses downstream patch: https://review.fuel-infra.org/#/c/26689/

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-21:

#24

The fix for race conditions is on review for Mitaka:
https://review.fuel-infra.org/#/c/26664/

(this fix is already merged to master branch.)

Timur Nurlygayanov (tnurlygayanov) on 2016-09-22

summary:

- More than15% of tempest tests triggers to an Error on CI: 401
+ More than 15% of tempest tests triggers to an Error on CI: 401
Unauthorized error

Revision history for this message

Oleksiy Butenko (obutenko) wrote on 2016-09-23: Re: More than 15% of tempest tests triggers to an Error on CI: 401 Unauthorized error

#25

verified on 9.1 snapshot #299

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-10-05:

#27

The priority is High, the issue reproduced not in 100% of cases and manual run of Tempest tests works fine.

Timur Nurlygayanov (tnurlygayanov) on 2016-10-05

summary:

- More than 15% of tempest tests triggers to an Error on CI: 401
- Unauthorized error
+ Tempest tests failed with the error: 401 Unauthorized error

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-10-05:

#28

The issue reproduced again on MOS 9.1 snapshot #347 (after RC3):

https://mirantis.testrail.com/index.php?/runs/view/22608&group_by=cases:section_id&group_order=asc&group_id=49558

401 Unauthorized.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-10-06:

#29

I cannot merge https://review.fuel-infra.org/#/c/27042/ due to HCF, so please apply it manually. It should give some additional information on what's happening.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-10-06:

#30

Sorry, wring link ^
Here is the correct one: https://review.fuel-infra.org/#/c/26689/

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2016-10-14:

#31

The same problem is seen on mos-integration tests for FUEL LDAP plugin.

Test steps are the following:
        1. Check that LDAP plugin is installed
        2. Login as admin/admin, domain: default
        3. Create a new project
        4. Set role 'admin' to 'user01' in domain openldap1 and project
        5. Relogin as user01/1111, domain: openldap1
        6. Create a network and subnetwork for the new project
        7. Create an instance
        8. Repeat steps 2-7 for domain AD2

The problem occurs on step 5.

Actual place in test code where the fault was couch is https://github.com/Mirantis/mos-integration-tests/blob/master/mos_tests/keystone/test_keystone_ldap.py#L893

Example of the test fail is available here https://mirantis.testrail.com/index.php?/tests/view/19035866

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-10-14:

#32

Yury, does the problem happen every time? If yes, it is unrelated to this bugreport and you probably need to open a new one.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-10-14:

#33

I found out that i did not comment on this bug.

Here is what debugging showed.

`RolesNegativeTestJSON:test_delete_role_by_unauthorized_user` fetches a token;
`RolesNegativeTestJSON:test_delete_role_request_without_token` deletes this token;
`RolesNegativeTestJSON:test_list_roles_request_without_token` tries to delete this token again.

3 different tests try to operate with the same token. Debug output shows that this token came from keystone only once. So, tempest for some reason attempts to reuse the token multiple times.

http://paste.openstack.org/show/584887/ -- here is this part. It is very
verbose, expect to spend some time there. Token starts with
gAAAAABX98nmFrAEeMZlcpefA9jsL.

The bugreport should also be marked as not affecting production.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-10-19:

#34

I want to disable caching in tempest and see if it fixes the issue, and if not, insert more debug statements. Unfortunately, i am not familiar with our test running suite. That's why i am moving the bugreport to MOS QA because i need their help to continue debugging.

MOS QA, please ping me when you have time.

Revision history for this message

Maksym Shalamov (mshalamov) wrote on 2016-10-20:

#35

Hello Boris,

We will provide to you all required help from our side.

I scheduled the meeting where we will discuss required steps which help to recognize root cause of the issue.

Best regards,

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-12-20:

#36

ETA for RCA: Dec 23.

Revision history for this message

Sergii Turivnyi (sturivnyi) wrote on 2016-12-20:

#37

This bug is reproducing once per six month.
Lately we can't reproduce it.
When it will reproduced we (QA team) will collect logs and provide environment for debug

Revision history for this message

Sofiia Andriichenko (sandriichenko) wrote on 2016-12-22:

#38

Bug reproduced 9.2 snapshot #664
Today, I can provide environment for debug
snapshot: https://drive.google.com/a/mirantis.com/file/d/0BxPLDs6wcpbDdDhZS1ZiR3Y4QW8/view?usp=sharing

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-12-27:

#39

The issue was not reproducible on the env described above.

I am not sure why MOS Keystone was assigned, because we figured out that the issue is in tempest. I am moving the bugreport back to QA.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2017-01-09:

#40

So dev team can't reproduce the issue, the bug marked as Incomplete until it wasn't reproduced again and QA team will identify the exact steps to reproduce.

Revision history for this message

Oleksiy Butenko (obutenko) wrote on 2017-01-11:

#41

if you need access to env and steps to reproduce - contact me
reproduced again on MOS 9.2 snapshot #744

Tatyana Borukhovich (tatyanaborukhovich) on 2017-01-11

tags:

added: blocker-for-qa

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-20: Fix proposed to openstack/keystone (mcp/newton)

#42

Fix proposed to branch: mcp/newton
Change author: Lance Bragstad <email address hidden>
Review: https://review.fuel-infra.org/33600

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-20: Fix proposed to openstack/keystone (11.0/ocata)

#43

Fix proposed to branch: 11.0/ocata
Change author: Lance Bragstad <email address hidden>
Review: https://review.fuel-infra.org/34091

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-24: Fix proposed to openstack/keystone (mcp/ocata)

#44

Fix proposed to branch: mcp/ocata
Change author: Lance Bragstad <email address hidden>
Review: https://review.fuel-infra.org/34796

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-26: Change abandoned on openstack/keystone (mcp/ocata)

#45

Change abandoned by Roman Podoliaka <email address hidden> on branch: mcp/ocata
Review: https://review.fuel-infra.org/34796
Reason: in favor of Iaee0ec8c7acd512b9d93096ce8306a2952061c7a included in upstream stable/ocata that we sync

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-26: Change abandoned on openstack/keystone (mcp/newton)

#46

Change abandoned by Roman Podoliaka <email address hidden> on branch: mcp/newton
Review: https://review.fuel-infra.org/33600
Reason: in favor of Iaee0ec8c7acd512b9d93096ce8306a2952061c7a that is included to upstream stable/newton that we sync

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-26: Change abandoned on openstack/keystone (11.0/ocata)

#47

Change abandoned by Roman Podoliaka <email address hidden> on branch: 11.0/ocata
Review: https://review.fuel-infra.org/34091
Reason: we don't use 11.0/ocata anymore - mcp/ocata is the correct branch name

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Confirmed	Medium	Unassigned	Mirantis OpenStack 10.0
9.x	Confirmed	Medium	Unassigned	Mirantis OpenStack 9.x-updates

Mirantis OpenStack

Tempest tests failed with the error: 401 Unauthorized error

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches