Fuel for OpenStack

Keystone with memcached backend may fail in get tokens after the memcached restart

Series 6.0.x
Bug #1432242

Bug #1432242 reported by Sergey Vasilenko on 2015-03-14

This bug affects 4 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	Critical	Sergii Golovatiuk	Fuel for OpenStack 6.1
5.1.x	Won't Fix	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 5.1.1-updates
6.0.x	Fix Committed	Critical	Bogdan Dobrelya	Fuel for OpenStack 6.0-updates
6.1.x	Fix Released	Critical	Sergii Golovatiuk	Fuel for OpenStack 6.1

Bug Description

sometime successful deploy was fail with diagnostic:

AssertionError: Cidr after deployment is not equal to cidr by default

But ssh to ENV and start
# fuel health --env 1 --check sanity
give positive result.

In the logs I found following related information:

2015-03-14 19:16:30,863 - DEBUG __init__.py:50 -- Done: get_nailgun_cidr_neutron with result: 192.168.196.0/22
2015-03-14 19:16:30,863 - DEBUG fuel_web_client.py:1366 -- nailgun cidr is 192.168.196.0/22
2015-03-14 19:16:30,863 - DEBUG helpers.py:324 -- Executing command: '. openrc; neutron subnet-list | awk '$4 == "net04__subnet"{print $6}''
2015-03-14 19:16:32,553 - DEBUG fuel_web_client.py:1372 -- slave cidr is
2015-03-14 19:16:32,554 - DEBUG __init__.py:45 -- Calling: generate_logs with args: (<fuelweb_test.models.nailgun_client.NailgunClient object at 0x7f35fc749ad0>,)

I can't imagine, why somebody make decision to use awk instead 'neutron subnet-list -f json' and parse json locally.
But
# neutron subnet-list | awk '$4 == "net04__subnet"{print $6}'
started on the controller give positive result, expected by test toolkit.

more information can be obtained on
https://fuel-jenkins.mirantis.com/job/master.fuel-library.ubuntu.ha_neutron_vlan/1253

UPDATE
Reproducing steps:
0) Deploy Ubuntu HA with 3 controllers
1) restart memcached services at controllers one by one, but do not restart keystone services
2) At some controller, run ". openrc; while true; do date; neutron subnet-list | awk '$4 == "net04__subnet" {print}'; sleep 2; done"
3) watch for periodic message in keystone logs: "WARNING keystonemiddleware.auth_token [-] Authorization failed for token"
There are also sporadic failures with neutron subnet-list authorization
and the tracebacks in neutron server logs similar to http://pastebin.com/n4xaj5yW

In order to "fix" it, just restart keystone services at controller nodes

See original description

Tags:

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2015-03-14:

https://github.com/stackforge/fuel-qa/blob/ee7c5b3332d145e8dcad09685b0d8831f594aa1f/fuelweb_test/models/fuel_web_client.py#L1350

Revision history for this message

Aleksandra Fedorova (bookwar) wrote on 2015-03-14:

fail_error_deploy_neutron_vlan_ha-2015_03_14__19_17_48.tar.gz Edit (96.1 MiB, application/x-tar)

summary:	- fuel sanity check unreasonably falls after deployment + error on CI: Cidr after deployment is not equal to cidr by default
Changed in fuel:
importance:	Undecided → Critical

Revision history for this message

Aleksandra Fedorova (bookwar) wrote on 2015-03-14: Re: error on CI: Cidr after deployment is not equal to cidr by default

This issue apperas in many test runs on CI, thus Critical

Revision history for this message

Aleksandra Fedorova (bookwar) wrote on 2015-03-15:

reproduced internally http://jenkins-product.srt.mirantis.net:8080/job/master.fuel-library.ubuntu.ha_neutron_vlan/7/

environment is available

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2015-03-15:

You have some specific env on Fuel-CI, as I can see this issue is no reproduce with 6.1 branch?

Changed in fuel:
importance:	Critical → High
tags:	added: system-tests removed: ostf
Changed in fuel:
status:	New → Confirmed

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2015-03-15:

This issue reproduced on CI.
and maybe not reproduced on BVT.

I was revert env
http://jenkins-product.srt.mirantis.net:8080/job/master.fuel-library.ubuntu.ha_neutron_vlan/7/
and pass ostf by hands by cli.
http://paste.openstack.org/show/192488/

Its looks like very strange.

CI blocked by this issue.

Changed in fuel:
importance:	High → Critical

Revision history for this message

Aleksandra Fedorova (bookwar) wrote on 2015-03-16:

6.1 fuel version = current master branch

So this is exactly jobs which tests code for 6.1 and they block CI for it

https://fuel-jenkins.mirantis.com/job/master.fuel-library.ubuntu.ha_neutron_vlan/1253
http://jenkins-product.srt.mirantis.net:8080/job/master.fuel-library.ubuntu.ha_neutron_vlan/7/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

Suggested w/a https://review.openstack.org/164630

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

It seems that RC is unexpected token invalidation which ends up with "GET /v2.0/subnets.json HTTP/1.1" 401 Error.
For example, see job logs for http://jenkins-product.srt.mirantis.net:8080/job/master.fuel-library.ubuntu.ha_neutron_vlan/7/console

There are in logs:
node-1.test.domain.local/neutron-server.log:2015-03-15 12:13:52.171 21040 TRACE keystonemiddleware.auth_token InvalidToken: Token authorization failed
node-1.test.domain.local/keystone-all.log:2015-03-15T12:13:52.168088+00:00 warning: Could not find token: a9f105443f3e49caac8eecfe8c6719a8

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

#10

Another example of failed token is https://fuel-jenkins.mirantis.com/job/master.fuel-library.ubuntu.ha_neutron_vlan/1225/console
Here CIDR check was passed, but OSTF failed later due to:
2015-03-13T15:46:34.758406 node-4 ./node-4.test.domain.local/neutron-server.log:2015-03-13T15:46:34.758406+00:00 info: 2015-03-13 15:46:34.756 17827 INFO neutron.wsgi [-] 10.109.7.2 - - [13/Mar/2015 15:46:34] "GET //v2.0/networks.json HTTP/1.1" 401 283 0.289824

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

#11

Here is the RC of "unexpected token invalidation" as I see it. We use memcached backend for keystone to store tokens, and deployment process does restarts of memcached service. It is expected to restart it as we configure memcached as well. But all tokens are vanished then without a proper expiration procedure, so the keystonemiddleware.auth_token reports "Could not find token" for such tokens as keystone considers them alive.

Aleksandra Fedorova (bookwar) on 2015-03-16

information type:

Private → Public

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

#12

But I'm not sure where the keystone keeps all expiry related state then we use memchached backend. If it stores it also in memcached, then it would be lost on restart as well and the only fix then should be done for keystone client to reissue token, if "not found" error.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

#13

I asked Yuriy Taraday, and he explained that keystone keeps all expiry info in the non persistent memcached backend. So it seems that the proper solution will be to fix neutron client to retry 401 responces more that 1 time as it does now

Nastya Urlapova (aurlapova) on 2015-03-16

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

#14

I checked the CLI behavior, it creates a new token every time (there is a POST request in keystone logs), so my guess in comment #11 was wrong

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-16: Related fix proposed to fuel-qa (master)

#15

Related fix proposed to branch: master
Review: https://review.openstack.org/164796

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16: Re: error on CI: Cidr after deployment is not equal to cidr by default

#16

So, as a workaround we will add @retry decorator for test case which blocks CI.

The complete solution should be a fix for memcached backend for keystone - sometimes the keystone cannot find a token recently issued by a client. We should debug both POST and failed GET requests from client to keystone API and deeper to the destination backendsin order to figure out that is going wrong

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → MOS Keystone (mos-keystone)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-16:

#17

MOS-Keystone team, please help us to debug the issue with backend.

The w/a for fuel_web tests are:
https://review.openstack.org/#/c/164649/
https://review.openstack.org/#/c/164796/

Bogdan Dobrelya (bogdando) on 2015-03-16

description:

updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-17:

#18

This issue was introduced with separate memcached deploy merged at Mar, 3 2015. Affected CI job deploys "deploy_neutron_vlan_ha" test group which installs swift. The ISO #179 for CI with this code was updated at Mar, 10 2015 and we started to be hit by this issue from Mar 11, 2015.

The issue is the same as I mentioned in bug's description - then memcached restarted, keystone may fail to find tokens for ~30 minutes period. And we have another place for memcached installation for Swift deploy step https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/swift/proxy.pp#L63-65. So, this code started to behave in another way once we removed memcached class from the catalog to the separate deployment step https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/osnailyfacter/modular/memcached/memcached.pp

So, the related fix for fuel library is to ensure the keystone service restarted as well once we restarted the memcached.

I believe that the complete fix is still should be done for Keystone as there are shouldn't be any temporary issues with keystone then we restart its backend, so I leave this bug for MOS-Keystone team.

Bogdan Dobrelya (bogdando) on 2015-03-17

Changed in fuel:
status:	Confirmed → Triaged

Bogdan Dobrelya (bogdando) on 2015-03-17

summary:

- error on CI: Cidr after deployment is not equal to cidr by default
+ Keystone with memcached backend may fail in get tokens after the
+ memcached restart

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-03-17:

#19

Can we just work around the memcached restart issue by adding parameter service_restart => false to swift/proxy.pp?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-17:

#20

Yes, that is an option

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2015-03-17:

#21

No need to restart keystone. We can add a stub for memcached class in swift task like this:

class memcached {}
include memcached

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-17: Fix proposed to fuel-library (master)

#22

Fix proposed to branch: master
Review: https://review.openstack.org/165025

Changed in fuel:
assignee:	MOS Keystone (mos-keystone) → Aleksandr Didenko (adidenko)
status:	Triaged → In Progress

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2015-03-17:

#23

No need to create sutbs :) We can simply remove memcached configuration from openstack::swift::proxy class since we have a separate task for it already.

OpenStack Infra (hudson-openstack) on 2015-03-17

Changed in fuel:
assignee:	Aleksandr Didenko (adidenko) → Egor Kotko (ykotko)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-17:

#24

Related bug for runlevels https://bugs.launchpad.net/fuel/+bug/1433038

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-17: Related fix proposed to fuel-library (master)

#25

Related fix proposed to branch: master
Review: https://review.openstack.org/165034

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-03-17:

#26

I think this solution we're looking at is inadequate. There could be more services later on that depend on memcached. Also, it's expected that keystone should be able to sort itself out appropriately if we restart memcached. The services all can tolerate restart of MySQL and RabbitMQ. Memcached should be no different.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-17: Change abandoned on fuel-library (master)

#27

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/165034
Reason: wrong subbmit, see https://review.openstack.org/#/c/165025

Nastya Urlapova (aurlapova) on 2015-03-17

Changed in fuel:
assignee:	Egor Kotko (ykotko) → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-17:

#29

@Matthew, I agree that keystone should tolerate restart of memcached and that is exactly the subject of complete fix. That is why the bug is assinged to MOS-Keystone. But we still have to unblock our CI for fuel library, so we should accept a w/a as well

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → MOS Keystone (mos-keystone)
status:	In Progress → Confirmed

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-17:

#30

What stops working after memcache restart?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-17: Related fix merged to fuel-library (master)

#31

Reviewed: https://review.openstack.org/165025
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=e3867b699c6a32a837e64c4a892e4b56bda38609
Submitter: Jenkins
Branch: master

commit e3867b699c6a32a837e64c4a892e4b56bda38609
Author: Aleksandr Didenko <email address hidden>
Date: Tue Mar 17 12:52:59 2015 +0200

Remove memcached configuration from swift task

    We configure memcached in a separate task so there's no need to do
    the same in openstack::swift::proxy class, unless we moved swift to
    the separate role.

    Fix pre- post- tests for swift and memcached:
     * Add memcached process check for swift post test
     * Add testcase for memcached should not listen for public ip

Fix swift::proxy::cache parameter.

DocImpact: (Ops guide) if the memcached service was restarted, the
keystone service must be restarted next as well.

    Related-bug: #1432242
    Change-Id: I5e965f12e5ba0003004ae10442d63c358a104367
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Alexander Makarov (amakarov) on 2015-03-17

Changed in fuel:
status:	Confirmed → In Progress

Mike Scherbakov (mihgen) on 2015-03-17

Changed in fuel:
assignee:	MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-17: Fix proposed to fuel-library (master)

#32

Fix proposed to branch: master
Review: https://review.openstack.org/165203

Changed in fuel:
assignee:	Alexander Makarov (amakarov) → Sergii Golovatiuk (sgolovatiuk)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-18: Fix merged to fuel-library (master)

#33

Reviewed: https://review.openstack.org/165203
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b001daf6c6352b1dac2a7339299ba1fa7d010c1e
Submitter: Jenkins
Branch: master

commit b001daf6c6352b1dac2a7339299ba1fa7d010c1e
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Mar 17 21:01:56 2015 +0100

Tune keystone settings to fail-over faster

    * Reduce dead_retry from 300 to 30. When 3 servers lost memcached server
      and detect it differently so the divergence of hash function is 5
      minutes.
    * Decrease socket_timeout from 3 seconds to 1 second to detect lost backend
      faster
    * Increase pool size from 100 to 1000 connections.

Closes-Bug: 1432242

Change-Id: Ib16a9cba4de8442afa2b78770cb93f561b0fd37e
Signed-off-by: Sergii Golovatiuk <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-23: Fix merged to fuel-qa (master)

#34

Reviewed: https://review.openstack.org/164649
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=585f9af050c422f5b60391d2b88bfb5c1f7b2aec
Submitter: Jenkins
Branch: master

commit 585f9af050c422f5b60391d2b88bfb5c1f7b2aec
Author: Egor Kotko <email address hidden>
Date: Mon Mar 16 12:15:56 2015 +0100

Add output of full subnet from slave node

Temporary solution for debug

Change-Id: I34d1f6e612363c0ecc0a567cd4154f822e12918a
closes-bug: #1432242

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-25: Change abandoned on fuel-qa (master)

#35

Change abandoned by Nastya Urlapova (<email address hidden>) on branch: master
Review: https://review.openstack.org/164796

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-02: Fix proposed to fuel-library (stable/6.0)

#36

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/170035

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-02: Fix merged to fuel-library (stable/6.0)

#37

Reviewed: https://review.openstack.org/170035
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=afa696f397bf0ceeb0c9b97385085f2ad16c80ce
Submitter: Jenkins
Branch: stable/6.0

commit afa696f397bf0ceeb0c9b97385085f2ad16c80ce
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Mar 17 21:01:56 2015 +0100

Tune keystone settings to fail-over faster

Closes-Bug: 1432242

    Change-Id: Ib16a9cba4de8442afa2b78770cb93f561b0fd37e
    Signed-off-by: Sergii Golovatiuk <email address hidden>
    (cherry picked from commit b001daf6c6352b1dac2a7339299ba1fa7d010c1e)

Sergey Novikov (snovikov) on 2015-05-21

tags:

added: on-verification

Revision history for this message

Sergey Novikov (snovikov) wrote on 2015-05-21:

#38

Verified on fuel-6.1-445-2015-05-20_22-10-04.iso.

Steps to verify:
    1. Deploy Ubuntu HA with 3 controllers
    2. Restart memcached services at controllers one by one, but do not restart keystone services
    3. At some controller, run ". openrc; while true; do date; neutron subnet-list | awk '$4 == "net04__subnet" {print}'; sleep 2; done"

tags:

removed: on-verification

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-10-23:

#39

Won't fix for 5.1.1-updates as this is deployment time fix and we expect no new 5.1.1 deployments