Deployment in scale test case failed with '409 Conflict' error on flavor creation

Bug #1449584 reported by Tatyanka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Bogdan Dobrelya
6.1.x
Fix Committed
Medium
Fuel Library (Deprecated)
Mirantis OpenStack
Won't Fix
Medium
Alexander Makarov
6.0.x
Won't Fix
Critical
Alexander Makarov
7.0.x
Won't Fix
Medium
Alexander Makarov

Bug Description

http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.centos.thread_4/107/testReport/%28root%29/ha_flat_scalability/ha_flat_scalability/?
[root@nailgun nginx]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "357"
  build_id: "2015-04-27_22-54-38"
  nailgun_sha: "5e52637d9944c2f4170012560d15ecf89a691af6"
  python-fuelclient_sha: "8cd6cf575d3c101dee1032abb6877dfa8487e077"
  astute_sha: "c1793f982fda7e3fc7b937ccaa613c649be6a144"
  fuel-library_sha: "0e5b82d24853304befb22145ac4aaf3545d295e1"
  fuel-ostf_sha: "b38602c841deaa03ddffc95c02f319360462cbe3"
  fuelmain_sha: "1ec588d364b9b97f124f6d602dbcc4aa13327218"

Scenario:
"""Check HA mode on scalability

        Scenario:
            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Add 2 controller nodes
            5. Deploy changes
            6. Run network verification
            7. Add 2 controller nodes
            8. Deploy changes
            9. Run network verification
            10. Run OSTF

On Centos test fail with deployment error: on step 7
When we add controllers, puppet start to redeploy primary controller, check if micro flavor exist and fail with 401 auth error, then puppet try to create new micro flavor and fail with 409 Conflict error

http://paste.openstack.org/show/210248/

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I guess the solution is to mask the flavor creation command exit code with /bin/true

Changed in fuel:
status: Confirmed → Triaged
tags: added: low-hanging-fruit
Ryan Moe (rmoe)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Ryan Moe (rmoe)
Revision history for this message
Ryan Moe (rmoe) wrote :

The issue is with the "unless" guard on the flavor creation statement. When it attempts to list all flavors to look for m1.micro it fails with a 401 error. That causes the flavor create command to run when it shouldn't. There is an error message in the nova logs:

 Identity response: {"error": {"message": "Could not find token: bccaec7e76974e7d8b2e54565f1df2f4", "code": 404, "title": "Not Found"}}

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There is an issue with memcache_pool backend "invalidating" cached tokens after new nodes have been added into the memcached pool. The "invalidation" is nothing more than just a case when keystone tries to search existing tokens at the recently added memcached nodes and fails as there are no such ones - and it cannot yet failover due to the hash function is being rebuild for new nodes.

The solution is to fix memcache pool backend as appropriate, but this cannot be done for the 6.1. The workaround is to issue token-get right before to perform any CLI operations, and pass this token as argument.

The documentation should also be updated to point on this scale issue for memcache_pool keystone backend and recommend to always re-get new tokens after new controller nodes have been added.

Changed in mos:
milestone: none → 7.0
importance: Undecided → High
assignee: nobody → MOS Keystone (mos-keystone)
tags: added: release-notes
removed: low-hanging-fruit
Revision history for this message
Boris Bobrov (bbobrov) wrote :

> The solution is to fix memcache pool backend as appropriate

No. It is not memcache_pool's logic, it is python-memcached logic. We need to either fix it there, or use sql. Or wait for Fernet tokens.

Revision history for this message
Alexander Makarov (amakarov) wrote :

To make this case work we need something like Cassandra as a token backend since it stores locations for keys, and it would be an overkill. It was a proposal in the Community and it was rejected.
The problem will be no more as soon as we adopt non-persistent Fernet tokens.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Folks

This is a critical issue for production deployments - if you add controller you do not expect interruption of service. We need to fix keys storing algorithm and we cannot wait for Fernet tokens.

Changed in mos:
status: New → Confirmed
importance: High → Critical
Changed in mos:
milestone: 7.0 → 6.1
Revision history for this message
Boris Bobrov (bbobrov) wrote :

> This is a critical issue for production deployments - if you add controller you do not expect interruption of service. We need to fix keys storing algorithm and we cannot wait for Fernet tokens.

The way you work with keystone tokens is wrong. You should not assume any time of token validity. The code should assume that the token can become invalid at any time and be ready to fetch a new one. There are many events that can lead to invalidation of token: node failure, token or user credentials compromisation, user role changes, project changes etc. Adding a controller node is just one of these events.

There is no way to "fix key storing algorithm" without rewriting python-memcache. Even if we exert ourselves and rewrite python-memcache, I don't think it's OK to include such a huge change during code freeze.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

>The way you work with keystone tokens is wrong. You should not assume
>any time of token validity. The code should assume that the token can
>become invalid at any time and be ready to fetch a new one.

I agree here. That is why I suggested to modify the way Fuel works with tokens and update the documentation it the way users would know that to change as well.

Revision history for this message
Alexander Makarov (amakarov) wrote :

> We need to fix keys storing algorithm and we cannot wait for Fernet tokens.
Disagree: this is not the root cause. If you want to fix lost tokens, you need to make client to ask tokens again if 401 response is received on a token supposed to be valid.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

This problem will disappear automatically in Kilo where Fernet tokens are already implemented. Memcache dependancy is already removed in Kilo https://github.com/openstack/keystone/blob/master/requirements.txt

Revision history for this message
Boris Bobrov (bbobrov) wrote :

Sergii, it was removed by another reason, not because of tokens.

Ryan Moe (rmoe)
Changed in fuel:
assignee: Ryan Moe (rmoe) → Fuel Library Team (fuel-library)
Revision history for this message
Andrew Woodward (xarses) wrote :

Guys, the program creating failed token request here is the python-novaclient CLI, with password based auth from

https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/osnailyfacter/modular/openstack-controller/openstack-controller.pp#L297-309

Revision history for this message
Andrew Woodward (xarses) wrote :

We can reproduce this with (you may need to change the IP to match your controllers mgmt addr). It should occur within 10 times

while true; do echo 'flush_all 2' | nc 192.168.0.4 11211; nova --debug flavor-list || break ; done

Revision history for this message
Andrew Woodward (xarses) wrote :

Paste of the client failure http://paste.openstack.org/show/214577/
Paste of the keystone-api (in debug) http://paste.openstack.org/show/214580/
Paste of the nova-api http://paste.openstack.org/show/214579/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Andrew, confirmed - I can reproduce it every time on the each 3rd iteration, even w/o adding new controllers:

while true; do echo 'flush_all 2' | nc 192.168.0.4 11211; nova flavor-list || break ; done

http://paste.openstack.org/

This looks like a major issue of aforementioned keystone backend

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Sorry, it looks like floating, not stable :)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is a script you can try to check both token-based and creds-based cases http://pastebin.com/NnMW0c39
if you pass the 'creds' argument it would request nova flavor-list by nova user and password credentials, otherwise it would do it by new tokens for each request.
The script will silently exit with 1, if got "Unauthorized" error.

Note, I can reproduce the issue even if new tokens requested for each flavor-list

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Alexander

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I posted another bug to split the discussion https://bugs.launchpad.net/fuel/+bug/1451515 . This original bug can be worked around with puppet kludges, but should be fixed in mainline code obviously.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The backend should not return 401 unless it asked all memcached instances configured. This looks now like a programmatic error.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Given comments above, I can suggest nothing but this kludge https://bugs.launchpad.net/fuel/+bug/1449584/comments/2

Revision history for this message
Dmitry Savenkov (dsavenkov) wrote :

Alexander, we are waiting for your suggestions/feedback.

Changed in mos:
assignee: MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/180105
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d333d4d7fcbfe012f8621bf7515beb0f0c7d99ba
Submitter: Jenkins
Branch: master

commit d333d4d7fcbfe012f8621bf7515beb0f0c7d99ba
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 5 12:23:49 2015 +0200

    Use retries for flavor-create

    W/o this fix, keystone backend for tokens may
    sporadically fail to find a token on each nova CLI command.
    It may even fail if there would be a new token passed
    as an argument for each CLI command.

    For now, the solution is cannot be done at puppet side, but
    only the workaround, which is to retry nova CLI commands

    Related-bug: #1449584

    Change-Id: I0f4895fe01c6811b89699677a46cbcecc6f04930
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
Alexander Makarov (amakarov) wrote : Re: Deployment in scale test case failed woth 409 Conflict error on flavor creation

Added to known issues in release notes: https://review.openstack.org/#/c/180196/

Changed in mos:
status: Confirmed → Won't Fix
Changed in fuel:
status: Triaged → Fix Committed
Revision history for this message
Andrew Woodward (xarses) wrote :

Added to 7.0 to attempt to find better solution than hacky retries in puppet

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Vova, Bogdan, Andrew, please set one correct state for 6.1 and 7.0

tags: added: release-notes-done
removed: release-notes
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

It's already fixed in 6.1 with re-tries.

no longer affects: fuel/6.0.x
no longer affects: fuel/6.1.x
Changed in mos:
status: Won't Fix → Invalid
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Since it already has a solution in 7.0 (re-tries) I'm lowering priority to medium.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Better solution here is fernet tokens, which are not going to happen in 7.0 and will be implemented as blueprint for fernet tokens.

no longer affects: fuel/7.0.x
summary: - Deployment in scale test case failed woth 409 Conflict error on flavor
+ Deployment in scale test case failed with '409 Conflict' error on flavor
creation
Changed in mos:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.