Image creation over glance v1 failed with permission denied error

Bug #1443913 reported by Tatyanka
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Oleksiy Molchanov
7.0.x
Invalid
High
MOS Keystone

Bug Description

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2-6.1"
  api: "1.0"
  build_number: "310"
  build_id: "2015-04-13_22-54-31"
  nailgun_sha: "d22c074dec091e5ddd8ea3003c37665058303cd5"
  python-fuelclient_sha: "9208ff4a08dcb674ce2df132399a5aa3ddfac21c"
  astute_sha: "d96a80b63198a578b2c159edbd76048819039eb0"
  fuellib_sha: "8b80657e9ceed8d59c2dff1c11e1481c7e69380e"
  ostf_sha: "c2a76a60ec4ebbd78e508216c2e12787bf25e423"
  fuelmain_sha: "335d3ed09ed79bd37e1f7a90442c4831c8845582"
[root@nailgun ostf]#

Scenario:
OS: UBUNTU, nova network
Destroy two controllers and check pacemaker status is correct
Pre-Condition: Ha env with 3 controllers and 2 computes was deployed, snapshot was created and reverted
Scenario:
1. Destroy first controller
2. Check pacemaker status
3. Run OSTF
4. Revert environment
5. Destroy second controller
6. Check pacemaker status
7. Run OSTF

Actual result:
Test create/ update/deplete image over Glance v1 failed with 500 server error in logs:
In glance:
2015-04-14 04:07:47.848 32397 ERROR glance.image_cache [588c4f5b-decd-4f4a-a7f9-a659ee580014 e000a8d82b2446de91e8469509372579 34cb7c36ed4c4b93bab71228feca6628 - - -] Exception encountered while tee'ing image '8a5b4e28-289b-471f-a53c-b7ea4dbc2bfd' into cache: [Errno 13] Permission denied: '/var/lib/glance/image-cache/incomplete/8a5b4e28-289b-471f-a53c-b7ea4dbc2bfd'. Continuing with response.

in swift:
http://paste.openstack.org/show/203840/

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Alex Schultz (alex-schultz)
Revision history for this message
Alex Schultz (alex-schultz) wrote :

I'm looking into this issue and I'm not able to replicate the 500 error, but I am able to replicate some other issues with 401 errors and token issues. When you said you reverted the environment, do you mean that your restored the HA cluster so that all 3 servers were up and functional? I'm trying to get an exact set of steps to recreate the 500 from glance. Is this the scenario you witnessed the 500 from?

1) 3 nodes up (node-1 primary)
2) node-1 is killed (virsh destroy node-1)
3) node-2 takes over, verified via crm status
4) OSTF tests run OK
5) turn node-1 back on, wait for it to join the cluster
6) node-2 (now the primary) is killed (virsh destroy node-2)
7) node-3 takes over, verified via crm status
8) OSTF tests report 500 for glance v1 actions

What I'm seeing is 401 errors in OSTF after the failover occurs until a specific timeout is occurring and either a new token is generated or a cache somewhere is expiring.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Alex yes, scenario is exactly the same as you describe. As about 401 issue from keystone, I absolutely agree that we need to wait a little bit for new token, and I've waited for it, after that I faced with issue with ostf glance test over v1. I will look at CI tomorrow and may be can provide environment with this issue if it happens again

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Issue is reproduced again on Ubuntu here http://jenkins-product.srt.mirantis.net:8080/view/6.1_swarm/job/6.1.system_test.centos.thread_5/99/testReport/junit/%28root%29/ha_nova_destroy_controllers/ha_nova_destroy_controllers/ (I looked at Centos and there is no issue with permission denied errors) so seems issue is related to failover on Ubuntu

Revision history for this message
Alex Schultz (alex-schultz) wrote :

I was having a terrible time reproducing this issue outside of Jenkins. I was finally able to reproduce it on the jenkins box itself. It would appear that it is related to token issues after a failover event. I was able to find a bug[0] filed for glance that I believe is what is actually occurring. This issue also appears to be happening on centos and ubuntu. Once I got on to a system that this issue had happened on, I was able to see that the image is stuck in a queued state[1]. The glance log snippet shows a NotAuthorized error[2].

[0] https://bugs.launchpad.net/glance/+bug/1371121
[1] http://paste.openstack.org/show/205023/
[2] http://paste.openstack.org/show/205021/

summary: - Image creation over glance v1 failed with permission denided error
+ Image creation over glance v1 failed with permission denied error
Revision history for this message
Alex Schultz (alex-schultz) wrote :

I've assigned this bug to the mos-glance team as this particular issue seems to be in glance itself when it gets a 401 during the image upload v1 call. See http://paste.openstack.org/show/205021/ for the stack trace that is causing a 500.

Changed in fuel:
assignee: Alex Schultz (alex-schultz) → MOS Glance (mos-glance)
ruhe (ruhe)
Changed in fuel:
assignee: MOS Glance (mos-glance) → Inessa Vasilevskaya (ivasilevskaya)
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

Alex, the problem with 500 error code on the Glance side as I see it is dedicated to no NotAuthenticated exception procession here: https://github.com/openstack/glance/blob/master/glance/common/client.py#L64

Either wrapping the exception and returning 400 BadRequest here, or adding NotAuthenticated exception handler in https://github.com/openstack/glance/blob/master/glance/api/v1/images.py#L803 will fix the 500 error code.

I'm going to look into the matter and produce a fix and a covering test tomorrow.

Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :
Revision history for this message
Mike Fedosin (mfedosin) wrote :

I think it's better to add exception handler in the place, where concrete call was made https://github.com/openstack/glance/blob/stable/juno/glance/api/v1/images.py#L618-619

Also, if we want to handle exception.NotAuthenticated, we have to remember that it may be raised in two cases:

1. If our token expired, then we got another one here https://github.com/openstack/glance/blob/stable/juno/glance/common/client.py#L70 and after that the new one expired too. In that case using of exception.NotAuthenticated is correct.

2. If we had wrong credentials we got UNAUTHORIZED response from registry server https://github.com/openstack/glance/blob/stable/juno/glance/common/client.py#L515 , then we had an attempt to authenticate with wrong credentials again, but had the same response. As far as I see, catching of exception.NotAuthenticated here is not good, because we were authenticated, but not authorized.

Thus, can you attach glance registry log here too? I suppose it would help.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

You should be able to take a look at the registry logs from the snapshot attached to the bug. Unfortunately from what I saw there wasn't anything out of the ordinary in the registry logs and the only error was in the API log. Since this is a test around controller failures and we're using memcache as our token store, I was wondering if this was caused by either a memcache connection error from a stale connection or perhaps an inconsistency around the token validation that occurs are the start of the v1 call and the internal calls that are made to process it. As far as I know, the token isn't technically expired it's just missing because of the controller failure scenario and the credentials are fine because you can wait a bit and the call will work just fine.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Here is the snippet from the registry log items for 500 from the attached snapshot. http://paste.openstack.org/show/209188/

ruhe (ruhe)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Inessa Vasilevskaya <email address hidden>
Review: https://review.fuel-infra.org/6303

Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

Currently I can't reproduce the original issue, as HA is not functioning in the latest mos build due to corosync issues (see https://bugs.launchpad.net/fuel/+bug/1443800).

Revision history for this message
Mike Fedosin (mfedosin) wrote :
Download full text (3.8 KiB)

I just investigated this issue a little bit... Basing on the attached logs the sequence was:
0. Glance api got a request with valid token on the creation of a new image.
1. Adding an instance with status 'queued' to the DB with registry server. Here the authorization with the token was passed and then new image instance appeared.
https://github.com/openstack/glance/blob/stable/juno/glance/api/v1/images.py#L547
2. Then there was a check with the second request to the registry. And authorization was succeed, too.
https://github.com/openstack/glance/blob/stable/juno/glance/api/v1/images.py#L774
3. But on the next operation, when we wanted to update image status from 'queued' to 'saving' to begin uploading, we had authorization error with 404 response code from Keystone.
https://github.com/openstack/glance/blob/stable/juno/glance/api/v1/images.py#L618

The problem is that the token could not expire. These 3 operation were performed less than a second, just before uploading the file.
From attached logs I found the following:

2015-04-14T04:01:48.246451+00:00 debug: 2015-04-14 04:01:48.246 16570 DEBUG urllib3.connectionpool [-] "GET /v3/auth/tokens HTTP/1.1" 404 115 _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:430
2015-04-14T04:01:48.247693+00:00 debug: 2015-04-14 04:01:48.247 16570 DEBUG keystoneclient.session [-] RESP: _http_log_response /usr/lib/python2.7/dist-packages/keystoneclient/session.py:182
2015-04-14T04:01:48.247693+00:00 debug: 2015-04-14 04:01:48.247 16570 DEBUG keystoneclient.session [-] Request returned failure status: 404 request /usr/lib/python2.7/dist-packages/keystoneclient/session.py:345
2015-04-14T04:01:48.249078+00:00 warning: 2015-04-14 04:01:48.248 16570 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2015-04-14T04:01:48.249078+00:00 warning: 2015-04-14 04:01:48.248 16570 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: d90482e1854a46019815f322e94bbe75", "code": 404, "title": "Not Found"}}
2015-04-14T04:01:48.249078+00:00 debug: 2015-04-14 04:01:48.248 16570 DEBUG keystonemiddleware.auth_token [-] Token validation failure. _validate_token /usr/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py:889
2015-04-14 04:01:48.248 16570 TRACE keystonemiddleware.auth_token Traceback (most recent call last):
2015-04-14 04:01:48.248 16570 TRACE keystonemiddleware.auth_token File "/usr/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py", line 880, in _validate_token
2015-04-14 04:01:48.248 16570 TRACE keystonemiddleware.auth_token expires = _get_token_expiration(data)
2015-04-14 04:01:48.248 16570 TRACE keystonemiddleware.auth_token File "/usr/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py", line 436, in _get_token_expiration
2015-04-14 04:01:48.248 16570 TRACE keystonemiddleware.auth_token raise InvalidToken('Token authorization failed')
2015-04-14 04:01:48.248 16570 TRACE keystonemiddleware.auth_token InvalidToken: Token authorization failed
2015-04-14 04:01:48.248 16570 TRACE keystonemiddleware.auth_token
2015-04-14T04:01:48.250276+00:00 debug: 2015-04-14 04:01:48.250 165...

Read more...

Revision history for this message
Mike Fedosin (mfedosin) wrote :

Okay, the problem is that valid token is cached in keystone_middleware, but originally it was stored in memcached server on the destroyed controller. At some point cached token is expired and keystone_middleware tries to check authentication on the local Keystone server, but it obviously fails, because token instance was stored on the switched off controller.

This situation cannot be fixed now, but in MOS 7.0 there will be Fernet tokens, that aren't stored on a server (see https://mirantis.jira.com/browse/PROD-232). That's why any Keystone server will be able to authenticate a token.

Workaround for now is to wait 5 minutes (cached token expiration time) after removing a controller before next running of the tests.

So, status of this bug is Won't Fix for 6.1 and Invalid for 7.0+.

Revision history for this message
ruhe (ruhe) wrote :

Reasons to change ticket are listed in latest message from Mike.

Changed in fuel:
status: In Progress → Won't Fix
tags: added: release-notes
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Guys seems this problem happens very often, From time to time after deployment some request failed with 401 from keystone. Also after destructive action (like kill all haproxy processes on controller) we receive 401 from keystone during 3-6 minutes, For me this situation not looks like medium(especially the fact that we do not have such behaviour previously in 6.0 )

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Confirm that this problem has appeared in many tests recently.
For example after all nodes reboot keystone became functional only after 5-10 minutes.
logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: Won't Fix → Confirmed
assignee: Inessa Vasilevskaya (ivasilevskaya) → MOS Keystone (mos-keystone)
Revision history for this message
Mike Scherbakov (mihgen) wrote :

This can't be Invalid for 7.0 before you are certain that there is no such an issue. It supposed to be fixed in Kilo, but we need to check this.

For 6.1, please see if we can do any workarounds. It has pretty serious implications on our HA story.

Revision history for this message
Alexander Makarov (amakarov) wrote :

I remember we discussed this bug with Mike Fedosin, and the reason indeed appeared to be in KSM token caching. The easiest workaround I see for now is to disable that caching. There is a risk, however, of a serious performance impact.
The correct way is to request a new token on a client.

I'll prepare a patch to disable KSM caching as I don't want to introduce quick&dirty solution in the main repository. If this option wins, we can always adopt it in our distributive some way, but it has really slim chances to land in community.

Revision history for this message
Alexander Makarov (amakarov) wrote :

It seems there is no need for a patch:

https://review.fuel-infra.org/gitweb?p=openstack/keystonemiddleware.git;a=blob;f=keystonemiddleware/auth_token.py;h=ba2f1d2ab559f65828dcbb971aaed98f9439917b;hb=refs/heads/openstack-ci/fuel-6.1/2014.2#l299

just set 'token_cache_time' configuration option to -1 and token caching will be disabled.
(@mfedosin, haven't we found this option last time we investigated this bug?)

Changed in fuel:
assignee: MOS Keystone (mos-keystone) → Tatyanka (tatyana-leontovich)
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Here is logs from env where we got 401 during OSTF run

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Olesia Tsvigun (otsvigun) wrote :

Also with issue was reproduced on env with Vcenter support. Logs is attached.

Changed in fuel:
assignee: Tatyanka (tatyana-leontovich) → nobody
assignee: nobody → MOS Keystone (mos-keystone)
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Alex, seems we've had caching disabled, so seems there is no reason to set -1 according to global caching is disable here http://paste.openstack.org/show/232034/ . (If I understand correctly about options, you've mentioned above)

Revision history for this message
Alexander Makarov (amakarov) wrote :

Looks like you've found a wrong place: indeed, you don't need to change keystone config at all - it is glance config that holds middleware setup.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Tested with token_cache_time = -1 for glance (deploy env - run 100 times ostf, deploy env - power off primary controller -run ostf in loop (30 -50-100) ) Issue was not reproduced, so seems we should add this option into services *.conf files

Revision history for this message
Alexander Makarov (amakarov) wrote :

won't fix in 7.0 due to Fernet tokens

Changed in fuel:
assignee: MOS Keystone (mos-keystone) → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Invalid is a more proper state for 7.0, as Fernet tokens are actually going to fix the issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/185409

Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Oleksiy Molchanov (omolchanov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/185409
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=47491e08d6a74741d42d40311e5e1f5ad1bc4006
Submitter: Jenkins
Branch: master

commit 47491e08d6a74741d42d40311e5e1f5ad1bc4006
Author: Oleksiy Molchanov <email address hidden>
Date: Mon May 25 18:26:59 2015 +0300

    Disable token caching for glance-api

    Change-Id: I7509aa35bdc05a0fb386b6001deec1f407913099
    granular-ci: openstack_haproxy_controller
    Closes-Bug: 1443913

Changed in fuel:
status: In Progress → Fix Committed
tags: added: release-notes-done
removed: release-notes
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Inessa Vasilevskaya <email address hidden>
Review: https://review.fuel-infra.org/8114

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/glance (openstack-ci/fuel-7.0/2015.1.0)

Change abandoned by Mike Fedosin <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/8114

Roman Rufanov (rrufanov)
tags: added: customer-found support
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.