Keystone with memcached backend may fail in get tokens after the memcached restart

Bug #1432242 reported by Sergey Vasilenko
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Sergii Golovatiuk
5.1.x
Won't Fix
Critical
Fuel Library (Deprecated)
6.0.x
Fix Committed
Critical
Bogdan Dobrelya
6.1.x
Fix Released
Critical
Sergii Golovatiuk

Bug Description

sometime successful deploy was fail with diagnostic:

AssertionError: Cidr after deployment is not equal to cidr by default

But ssh to ENV and start
# fuel health --env 1 --check sanity
give positive result.

In the logs I found following related information:

2015-03-14 19:16:30,863 - DEBUG __init__.py:50 -- Done: get_nailgun_cidr_neutron with result: 192.168.196.0/22
2015-03-14 19:16:30,863 - DEBUG fuel_web_client.py:1366 -- nailgun cidr is 192.168.196.0/22
2015-03-14 19:16:30,863 - DEBUG helpers.py:324 -- Executing command: '. openrc; neutron subnet-list | awk '$4 == "net04__subnet"{print $6}''
2015-03-14 19:16:32,553 - DEBUG fuel_web_client.py:1372 -- slave cidr is
2015-03-14 19:16:32,554 - DEBUG __init__.py:45 -- Calling: generate_logs with args: (<fuelweb_test.models.nailgun_client.NailgunClient object at 0x7f35fc749ad0>,)

I can't imagine, why somebody make decision to use awk instead 'neutron subnet-list -f json' and parse json locally.
But
# neutron subnet-list | awk '$4 == "net04__subnet"{print $6}'
started on the controller give positive result, expected by test toolkit.

more information can be obtained on
https://fuel-jenkins.mirantis.com/job/master.fuel-library.ubuntu.ha_neutron_vlan/1253

UPDATE
Reproducing steps:
0) Deploy Ubuntu HA with 3 controllers
1) restart memcached services at controllers one by one, but do not restart keystone services
2) At some controller, run ". openrc; while true; do date; neutron subnet-list | awk '$4 == "net04__subnet" {print}'; sleep 2; done"
3) watch for periodic message in keystone logs: "WARNING keystonemiddleware.auth_token [-] Authorization failed for token"
There are also sporadic failures with neutron subnet-list authorization
and the tracebacks in neutron server logs similar to http://pastebin.com/n4xaj5yW

In order to "fix" it, just restart keystone services at controller nodes

Tags: system-tests
Revision history for this message
Sergey Vasilenko (xenolog) wrote :
Revision history for this message
Aleksandra Fedorova (bookwar) wrote :
summary: - fuel sanity check unreasonably falls after deployment
+ error on CI: Cidr after deployment is not equal to cidr by default
Changed in fuel:
importance: Undecided → Critical
Revision history for this message
Aleksandra Fedorova (bookwar) wrote : Re: error on CI: Cidr after deployment is not equal to cidr by default

This issue apperas in many test runs on CI, thus Critical

Revision history for this message
Aleksandra Fedorova (bookwar) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

You have some specific env on Fuel-CI, as I can see this issue is no reproduce with 6.1 branch?

Changed in fuel:
importance: Critical → High
tags: added: system-tests
removed: ostf
Changed in fuel:
status: New → Confirmed
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

This issue reproduced on CI.
and maybe not reproduced on BVT.

I was revert env
http://jenkins-product.srt.mirantis.net:8080/job/master.fuel-library.ubuntu.ha_neutron_vlan/7/
and pass ostf by hands by cli.
http://paste.openstack.org/show/192488/

Its looks like very strange.

CI blocked by this issue.

Changed in fuel:
importance: High → Critical
Revision history for this message
Aleksandra Fedorova (bookwar) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

It seems that RC is unexpected token invalidation which ends up with "GET /v2.0/subnets.json HTTP/1.1" 401 Error.
For example, see job logs for http://jenkins-product.srt.mirantis.net:8080/job/master.fuel-library.ubuntu.ha_neutron_vlan/7/console

There are in logs:
node-1.test.domain.local/neutron-server.log:2015-03-15 12:13:52.171 21040 TRACE keystonemiddleware.auth_token InvalidToken: Token authorization failed
node-1.test.domain.local/keystone-all.log:2015-03-15T12:13:52.168088+00:00 warning: Could not find token: a9f105443f3e49caac8eecfe8c6719a8

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Another example of failed token is https://fuel-jenkins.mirantis.com/job/master.fuel-library.ubuntu.ha_neutron_vlan/1225/console
Here CIDR check was passed, but OSTF failed later due to:
2015-03-13T15:46:34.758406 node-4 ./node-4.test.domain.local/neutron-server.log:2015-03-13T15:46:34.758406+00:00 info: 2015-03-13 15:46:34.756 17827 INFO neutron.wsgi [-] 10.109.7.2 - - [13/Mar/2015 15:46:34] "GET //v2.0/networks.json HTTP/1.1" 401 283 0.289824

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is the RC of "unexpected token invalidation" as I see it. We use memcached backend for keystone to store tokens, and deployment process does restarts of memcached service. It is expected to restart it as we configure memcached as well. But all tokens are vanished then without a proper expiration procedure, so the keystonemiddleware.auth_token reports "Could not find token" for such tokens as keystone considers them alive.

information type: Private → Public
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

But I'm not sure where the keystone keeps all expiry related state then we use memchached backend. If it stores it also in memcached, then it would be lost on restart as well and the only fix then should be done for keystone client to reissue token, if "not found" error.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I asked Yuriy Taraday, and he explained that keystone keeps all expiry info in the non persistent memcached backend. So it seems that the proper solution will be to fix neutron client to retry 401 responces more that 1 time as it does now

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I checked the CLI behavior, it creates a new token every time (there is a POST request in keystone logs), so my guess in comment #11 was wrong

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/164796

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: error on CI: Cidr after deployment is not equal to cidr by default

So, as a workaround we will add @retry decorator for test case which blocks CI.

The complete solution should be a fix for memcached backend for keystone - sometimes the keystone cannot find a token recently issued by a client. We should debug both POST and failed GET requests from client to keystone API and deeper to the destination backendsin order to figure out that is going wrong

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Keystone (mos-keystone)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

MOS-Keystone team, please help us to debug the issue with backend.

The w/a for fuel_web tests are:
https://review.openstack.org/#/c/164649/
https://review.openstack.org/#/c/164796/

description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This issue was introduced with separate memcached deploy merged at Mar, 3 2015. Affected CI job deploys "deploy_neutron_vlan_ha" test group which installs swift. The ISO #179 for CI with this code was updated at Mar, 10 2015 and we started to be hit by this issue from Mar 11, 2015.

The issue is the same as I mentioned in bug's description - then memcached restarted, keystone may fail to find tokens for ~30 minutes period. And we have another place for memcached installation for Swift deploy step https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/swift/proxy.pp#L63-65. So, this code started to behave in another way once we removed memcached class from the catalog to the separate deployment step https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/osnailyfacter/modular/memcached/memcached.pp

So, the related fix for fuel library is to ensure the keystone service restarted as well once we restarted the memcached.

I believe that the complete fix is still should be done for Keystone as there are shouldn't be any temporary issues with keystone then we restart its backend, so I leave this bug for MOS-Keystone team.

Changed in fuel:
status: Confirmed → Triaged
summary: - error on CI: Cidr after deployment is not equal to cidr by default
+ Keystone with memcached backend may fail in get tokens after the
+ memcached restart
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Can we just work around the memcached restart issue by adding parameter service_restart => false to swift/proxy.pp?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Yes, that is an option

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

No need to restart keystone. We can add a stub for memcached class in swift task like this:

class memcached {}
include memcached

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/165025

Changed in fuel:
assignee: MOS Keystone (mos-keystone) → Aleksandr Didenko (adidenko)
status: Triaged → In Progress
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

No need to create sutbs :) We can simply remove memcached configuration from openstack::swift::proxy class since we have a separate task for it already.

Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Egor Kotko (ykotko)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/165034

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

I think this solution we're looking at is inadequate. There could be more services later on that depend on memcached. Also, it's expected that keystone should be able to sort itself out appropriately if we restart memcached. The services all can tolerate restart of MySQL and RabbitMQ. Memcached should be no different.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/165034
Reason: wrong subbmit, see https://review.openstack.org/#/c/165025

Changed in fuel:
assignee: Egor Kotko (ykotko) → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Matthew, I agree that keystone should tolerate restart of memcached and that is exactly the subject of complete fix. That is why the bug is assinged to MOS-Keystone. But we still have to unblock our CI for fuel library, so we should accept a w/a as well

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Keystone (mos-keystone)
status: In Progress → Confirmed
Revision history for this message
Boris Bobrov (bbobrov) wrote :

What stops working after memcache restart?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/165025
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=e3867b699c6a32a837e64c4a892e4b56bda38609
Submitter: Jenkins
Branch: master

commit e3867b699c6a32a837e64c4a892e4b56bda38609
Author: Aleksandr Didenko <email address hidden>
Date: Tue Mar 17 12:52:59 2015 +0200

    Remove memcached configuration from swift task

    We configure memcached in a separate task so there's no need to do
    the same in openstack::swift::proxy class, unless we moved swift to
    the separate role.

    Fix pre- post- tests for swift and memcached:
     * Add memcached process check for swift post test
     * Add testcase for memcached should not listen for public ip

    Fix swift::proxy::cache parameter.

    DocImpact: (Ops guide) if the memcached service was restarted, the
      keystone service must be restarted next as well.

    Related-bug: #1432242
    Change-Id: I5e965f12e5ba0003004ae10442d63c358a104367
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: Confirmed → In Progress
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/165203

Changed in fuel:
assignee: Alexander Makarov (amakarov) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/165203
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b001daf6c6352b1dac2a7339299ba1fa7d010c1e
Submitter: Jenkins
Branch: master

commit b001daf6c6352b1dac2a7339299ba1fa7d010c1e
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Mar 17 21:01:56 2015 +0100

    Tune keystone settings to fail-over faster

    * Reduce dead_retry from 300 to 30. When 3 servers lost memcached server
      and detect it differently so the divergence of hash function is 5
      minutes.
    * Decrease socket_timeout from 3 seconds to 1 second to detect lost backend
      faster
    * Increase pool size from 100 to 1000 connections.

    Closes-Bug: 1432242

    Change-Id: Ib16a9cba4de8442afa2b78770cb93f561b0fd37e
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/164649
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=585f9af050c422f5b60391d2b88bfb5c1f7b2aec
Submitter: Jenkins
Branch: master

commit 585f9af050c422f5b60391d2b88bfb5c1f7b2aec
Author: Egor Kotko <email address hidden>
Date: Mon Mar 16 12:15:56 2015 +0100

    Add output of full subnet from slave node

    Temporary solution for debug

    Change-Id: I34d1f6e612363c0ecc0a567cd4154f822e12918a
    closes-bug: #1432242

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-qa (master)

Change abandoned by Nastya Urlapova (<email address hidden>) on branch: master
Review: https://review.openstack.org/164796

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.0)

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/170035

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.0)

Reviewed: https://review.openstack.org/170035
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=afa696f397bf0ceeb0c9b97385085f2ad16c80ce
Submitter: Jenkins
Branch: stable/6.0

commit afa696f397bf0ceeb0c9b97385085f2ad16c80ce
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Mar 17 21:01:56 2015 +0100

    Tune keystone settings to fail-over faster

    * Reduce dead_retry from 300 to 30. When 3 servers lost memcached server
      and detect it differently so the divergence of hash function is 5
      minutes.
    * Decrease socket_timeout from 3 seconds to 1 second to detect lost backend
      faster
    * Increase pool size from 100 to 1000 connections.

    Closes-Bug: 1432242

    Change-Id: Ib16a9cba4de8442afa2b78770cb93f561b0fd37e
    Signed-off-by: Sergii Golovatiuk <email address hidden>
    (cherry picked from commit b001daf6c6352b1dac2a7339299ba1fa7d010c1e)

tags: added: on-verification
Revision history for this message
Sergey Novikov (snovikov) wrote :

Verified on fuel-6.1-445-2015-05-20_22-10-04.iso.

Steps to verify:
    1. Deploy Ubuntu HA with 3 controllers
    2. Restart memcached services at controllers one by one, but do not restart keystone services
    3. At some controller, run ". openrc; while true; do date; neutron subnet-list | awk '$4 == "net04__subnet" {print}'; sleep 2; done"

tags: removed: on-verification
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't fix for 5.1.1-updates as this is deployment time fix and we expect no new 5.1.1 deployments

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.