Fuel for OpenStack

[regression] Keystone memcached backend unstable after switching to dogpile with pylibmc

Bug #1587771 reported by Dmitriy Kruglov on 2016-06-01

This bug report is a duplicate of: Bug #1589433: Stale oslo.cache leads to false-positive cache misses. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Alexander Makarov	Fuel for OpenStack 10.0
	Mitaka	Fix Committed	High	Alexander Makarov	Fuel for OpenStack 9.0

Bug Description

Steps to reproduce:
1. Create a cluster
2. Add 1 controller, 1 compute and 1 mongo nodes
3. Deploy the cluster

Expected result: the cluster is successfully deployed
Actual result: deployment fails with the error "Deployment has failed. All nodes are finished. Failed tasks: Task[upload_cirros/2] Stopping the deployment process!".

Error puppet.log http://paste.openstack.org/show/506851/.
Error in glance-api.log http://paste.openstack.org/show/506853/

Description of the environment:
MOS 9.0, build 424 (shotgun2 short-report - http://paste.openstack.org/show/506850/).

Tags:

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2016-06-01:

fail_error_lcm_deploy_1_ctrl_1_cmp_1_mongo-fuel-snapshot-2016-06-01_03-43-22.tar.gz Edit (25.1 MiB, application/x-tar)

Ilya Kutukov (ikutukov) on 2016-06-01

Changed in fuel:
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
status:	New → Confirmed
tags:	added: area-python

Dmitry Pyzhov (dpyzhov) on 2016-06-01

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → MOS Glance (mos-glance)
tags:	removed: area-python

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-01:

Will check it tomorrow morning when env with that iso will be deployed.

Changed in fuel:
assignee:	MOS Glance (mos-glance) → Kairat Kushaev (kkushaev)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-02:

The issue is still reproduced again on the latest SWARM

tags:

added: swarm-blocker

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-02:

Reproduced on MOS #427.

tags:	added: blocker-for-qa
summary:	- Deployment fails with + [regression] Deployment fails with ""/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]" error

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-02: Re: [regression] Deployment fails with ""/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]" error

I think the root of the issue in fuel-library code or in MOS Glance Puppet manifests, not in Glance code itself. It looks like a race condition during the deployment or incorrect chain for the deployment steps.

Volodymyr Shypyguzov (vshypyguzov) on 2016-06-02

no longer affects:

newton (Ubuntu)

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-02:

Looks like the bug is floating because it was not reproduced for ISO#426.
Waiting for env to be reverted to identify the root cause.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-02:

The credentials for environment with the issue were provided in slack, we are going to debug the issue.

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-02:

After additional analysis it turned out that ./upload_cirros_command.sh works on controller. So something happens with endpoints and they are not available during deployment.

Sergey Shevorakov (sshevorakov) on 2016-06-02

tags:

removed: swarm-blocker

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-03:

Apparently, this is temporary issue:
1. upload_cirros.sh works on failed env after revert
2. service catalog entry for swift exists and properly created
3. credentials in glance-api.conf seems correct.
4. we cannot reproduce it on fresh iso

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-03:

#10

According to discussion with Puppet team I mark that as Invalid until it won't be reproduces again.

Revision history for this message

Anastasia Kuznetsova (akuznetsova) wrote on 2016-06-03:

#11

Bug was reproduced again on ISO 432 in 3 system tests:
2016-06-03 02:00:38 +0000 /Stage[main]/Main/Exec[upload_cirros_shell]/returns (err): change from notrun to 0 failed: /bin/bash "/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]
2016-06-03 02:00:38 +0000 /Stage[main]/Main/Exec[upload_cirros_shell]/returns (notice): Error finding address for http://10.109.11.5:9292/v2/images/aac4c8c7-a788-437a-b4d4-da41a74ccfc7/file: Unable to establish connection to http://10.109.11.5:9292/v2/images/aac4c8c7-a788-437a-b4d4-da41a74ccfc7/file

Here is a link to job:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.services_ha_one_controller/127/console

Revision history for this message

Denis Egorenko (degorenko) wrote on 2016-06-03:

#12

After some debug and looking at logs we found:

<11>Jun 2 22:45:00 node-4 keystone-admin: 2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi [req-ed33f28d-5c25-4a95-b5c4-48d33f7ed82a - - - - -] error 31 from memcached_get(62340dd8e32ed12b9eea81ce3e4264fc): (139763353659456) A TIMEOUT OCCURRED, host: 10.109.26.2:11211 -> libmemcached/connect.cc:108
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi Traceback (most recent call last):
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/common/wsgi.py", line 249, in __call__
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi result = method(context, **params)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/common/controller.py", line 181, in inner
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi return f(self, context, *args, **kwargs)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/catalog/controllers.py", line 286, in get_service
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi ref = self.catalog_api.get_service(service_id)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/common/manager.py", line 124, in wrapped
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi __ret_val = __f(*args, **kwargs)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 1053, in decorate
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi should_cache_fn)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 657, in get_or_create
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi async_creator) as value:
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/core/dogpile.py", line 158, in __enter__
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi

So looks like we have some problem with memcache backend.

After some debug and looking at logs we found:

<11>Jun  2 22:45:00 node-4 keystone-admin: 2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi [req-ed33f28d-5c25-4a95-b5c4-48d33f7ed82a - - - - -] error 31 from memcached_get(62340dd8e32ed12b9eea81ce3e4264fc): (139763353659456) A TIMEOUT OCCURRED, host: 10.109.26.2:11211 -> libmemcached/connect.cc:108
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi Traceback (most recent call last):
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/dist-packages/keystone/common/wsgi.py", line 249, in __call__
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi     result = method(context, **params)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/dist-packages/keystone/common/controller.py", line 181, in inner
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi     return f(self, context, *args, **kwargs)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/dist-packages/keystone/catalog/controllers.py", line 286, in get_service
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi     ref = self.catalog_api.get_service(service_id)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/dist-packages/keystone/common/manager.py", line 124, in wrapped
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi     __ret_val = __f(*args, **kwargs)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 1053, in decorate
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi     should_cache_fn)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 657, in get_or_create
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi     async_creator) as value:
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi   File "/usr/lib/python2.7/dist-packages/dogpile/core/dogpile.py", line 158, in __enter__
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi

So looks like we have some problem with memcache backend.

Revision history for this message

Denis Egorenko (degorenko) wrote on 2016-06-03:

#13

We faced with this issue right after fix for bug https://bugs.launchpad.net/fuel/+bug/1581133 was committed. MOS-Keystone team, any ideas?

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-03:

#14

memcache_pool driver was moved to oslo_cache without a patch fixing this error in keystone: https://review.fuel-infra.org/#/c/6804/1/keystone/common/cache/_memcache_pool.py

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-03:

#15

https://review.openstack.org/#/c/325233/
https://review.openstack.org/#/c/325234/

reverts for 9.0 and master are on review

Vladimir Kuklin (vkuklin) on 2016-06-03

summary:

- [regression] Deployment fails with
- ""/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1
- instead of one of [0]" error
+ [regression] Keystone memcached backend unstable after switching to
+ dogpile with pylibmc

OpenStack Infra (hudson-openstack) on 2016-06-03

Changed in fuel:
assignee:	MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)

Sergey Shevorakov (sshevorakov) on 2016-06-03

tags:

added: swarm-blocker

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-03: Fix merged to fuel-library (master)

#16

Reviewed: https://review.openstack.org/325234
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=4eb97031c20625bd4f557caf31037e544f1a7ef8
Submitter: Jenkins
Branch: master

commit 4eb97031c20625bd4f557caf31037e544f1a7ef8
Author: Alexander Makarov <email address hidden>
Date: Fri Jun 3 11:36:22 2016 +0000

Revert "Set proper cache backend for keystone.conf"

This reverts commit da0f0dd670824eeddd4580ba1f56099d5d0299b8.

Change-Id: I7e8458b715a5b8af30134730fb15af2f92dead51
Closes-Bug: 1587771

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-04: Fix merged to fuel-library (stable/mitaka)

#17

Reviewed: https://review.openstack.org/325233
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d097caf36adefa20a49ab6de2ce8ee2e90856f70
Submitter: Jenkins
Branch: stable/mitaka

commit d097caf36adefa20a49ab6de2ce8ee2e90856f70
Author: Alexander Makarov <email address hidden>
Date: Fri Jun 3 11:33:36 2016 +0000

Revert "Set proper cache backend for keystone.conf"

This reverts commit e65f393e419a07ecd1a3cd1e5ade697a792e6617.

Change-Id: Iaca55d06089de86ce1ffa624bc2a518c6d2937f8
Closes-Bug: 1587771

Revision history for this message

Ilya Menkov (imenkov) wrote on 2016-06-06:

#18

The issue reproduced again
https://product-ci.infra.mirantis.net/view/9.0_swarm/job/9.0.system_test.ubuntu.services_ha_one_controller/130/console

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-06:

#19

Here is an error example: https://paste.mirantis.net/show/2341/.
this repeats 3 times and after that request goes successfully.

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-06:

#20

fail_error_deploy_heat_ha_one_controller_neutron-fuel-snapshot-2016-06-06_03-33-51.tar.gz Edit (42.7 MiB, application/x-tar)

Revision history for this message

Kairat Kushaev (kkushaev) wrote on 2016-06-06:

#21

The example when request succeed and endpoint was found in service catalog:
2016-06-06T03:26:50.659954+00:00 info: 2016-06-06 03:26:50.660 20912 INFO eventlet.wsgi.server [req-cd1efbd3-5758-4605-bad6-2f4b5ef5f41a 929e93a99387410fa8dde7987b469921 780658d4c33844a18124f542ab1cb76d - - -] 10.109.1.5 - - [06/Jun/2016 03:26:50] "PUT /v2/images/53d8e8d8-b482-450a-b90c-ffc68acde429/file HTTP/1.1" 204 208 1.891944

Here glance_store request endpoint:
https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/swift/connection_manager.py#L150

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-06:

#22

Regression wasn't reproduced - last glance failure is a different bug. Closing this one.

Maksim Malchuk (mmalchuk) on 2016-06-07

no longer affects:

fuel/newton

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2016-06-08:

#23

The issue is reproduced again on the latest swarm run, although the applied fix (revert) is present in the ISO under test
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.test_ensurability/34/testReport/junit/(root)/lcm_deploy_1_ctrl_1_cmp_1_mongo/lcm_deploy_1_ctrl_1_cmp_1_mongo/

puppet.log excerpt http://paste.openstack.org/show/508879/
glance-all.log excerpt http://paste.openstack.org/show/508886/

ISO details http://paste.openstack.org/show/508891/

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2016-06-08:

#24

fail_error_lcm_deploy_1_ctrl_1_cmp_1_mongo-fuel-snapshot-2016-06-08_00-56-23.tar.gz Edit (28.2 MiB, application/x-tar)

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-06-08:

#25

There are 2 bugs in this bugreport. One was fixed (the one with pylibmc), one is being worked on in bugreport 1589433.

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2016-06-08:

#26

It's strange to consider something fixed when the bug symptoms from initial scenario are present.
Anyway, thank you for pointing out to the current issue report where the bug is being fixed.

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-08:

#27

I moved the 9.0 milestone here to fix committed as partially bug was fixed here, the other part is stored in the duplicate bug.