[regression] Keystone memcached backend unstable after switching to dogpile with pylibmc

Bug #1587771 reported by Dmitriy Kruglov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Alexander Makarov
Mitaka
Fix Committed
High
Alexander Makarov

Bug Description

Steps to reproduce:
1. Create a cluster
2. Add 1 controller, 1 compute and 1 mongo nodes
3. Deploy the cluster

Expected result: the cluster is successfully deployed
Actual result: deployment fails with the error "Deployment has failed. All nodes are finished. Failed tasks: Task[upload_cirros/2] Stopping the deployment process!".

Error puppet.log http://paste.openstack.org/show/506851/.
Error in glance-api.log http://paste.openstack.org/show/506853/

Description of the environment:
MOS 9.0, build 424 (shotgun2 short-report - http://paste.openstack.org/show/506850/).

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Ilya Kutukov (ikutukov)
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
status: New → Confirmed
tags: added: area-python
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → MOS Glance (mos-glance)
tags: removed: area-python
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Will check it tomorrow morning when env with that iso will be deployed.

Changed in fuel:
assignee: MOS Glance (mos-glance) → Kairat Kushaev (kkushaev)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The issue is still reproduced again on the latest SWARM

tags: added: swarm-blocker
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Reproduced on MOS #427.

tags: added: blocker-for-qa
summary: - Deployment fails with
+ [regression] Deployment fails with
""/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1
instead of one of [0]" error
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote : Re: [regression] Deployment fails with ""/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]" error

I think the root of the issue in fuel-library code or in MOS Glance Puppet manifests, not in Glance code itself. It looks like a race condition during the deployment or incorrect chain for the deployment steps.

no longer affects: newton (Ubuntu)
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Looks like the bug is floating because it was not reproduced for ISO#426.
Waiting for env to be reverted to identify the root cause.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The credentials for environment with the issue were provided in slack, we are going to debug the issue.

Revision history for this message
Kairat Kushaev (kkushaev) wrote :

After additional analysis it turned out that ./upload_cirros_command.sh works on controller. So something happens with endpoints and they are not available during deployment.

tags: removed: swarm-blocker
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Apparently, this is temporary issue:
1. upload_cirros.sh works on failed env after revert
2. service catalog entry for swift exists and properly created
3. credentials in glance-api.conf seems correct.
4. we cannot reproduce it on fresh iso

Revision history for this message
Kairat Kushaev (kkushaev) wrote :

According to discussion with Puppet team I mark that as Invalid until it won't be reproduces again.

Revision history for this message
Anastasia Kuznetsova (akuznetsova) wrote :

Bug was reproduced again on ISO 432 in 3 system tests:
2016-06-03 02:00:38 +0000 /Stage[main]/Main/Exec[upload_cirros_shell]/returns (err): change from notrun to 0 failed: /bin/bash "/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]
2016-06-03 02:00:38 +0000 /Stage[main]/Main/Exec[upload_cirros_shell]/returns (notice): Error finding address for http://10.109.11.5:9292/v2/images/aac4c8c7-a788-437a-b4d4-da41a74ccfc7/file: Unable to establish connection to http://10.109.11.5:9292/v2/images/aac4c8c7-a788-437a-b4d4-da41a74ccfc7/file

Here is a link to job:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.services_ha_one_controller/127/console

Revision history for this message
Denis Egorenko (degorenko) wrote :

After some debug and looking at logs we found:

<11>Jun 2 22:45:00 node-4 keystone-admin: 2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi [req-ed33f28d-5c25-4a95-b5c4-48d33f7ed82a - - - - -] error 31 from memcached_get(62340dd8e32ed12b9eea81ce3e4264fc): (139763353659456) A TIMEOUT OCCURRED, host: 10.109.26.2:11211 -> libmemcached/connect.cc:108
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi Traceback (most recent call last):
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/common/wsgi.py", line 249, in __call__
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi result = method(context, **params)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/common/controller.py", line 181, in inner
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi return f(self, context, *args, **kwargs)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/catalog/controllers.py", line 286, in get_service
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi ref = self.catalog_api.get_service(service_id)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/keystone/common/manager.py", line 124, in wrapped
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi __ret_val = __f(*args, **kwargs)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 1053, in decorate
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi should_cache_fn)
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/cache/region.py", line 657, in get_or_create
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi async_creator) as value:
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi File "/usr/lib/python2.7/dist-packages/dogpile/core/dogpile.py", line 158, in __enter__
2016-06-02 22:45:00.763 1156 ERROR keystone.common.wsgi

So looks like we have some problem with memcache backend.

Revision history for this message
Denis Egorenko (degorenko) wrote :

We faced with this issue right after fix for bug https://bugs.launchpad.net/fuel/+bug/1581133 was committed. MOS-Keystone team, any ideas?

Revision history for this message
Alexander Makarov (amakarov) wrote :

memcache_pool driver was moved to oslo_cache without a patch fixing this error in keystone: https://review.fuel-infra.org/#/c/6804/1/keystone/common/cache/_memcache_pool.py

Revision history for this message
Dina Belova (dbelova) wrote :
summary: - [regression] Deployment fails with
- ""/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1
- instead of one of [0]" error
+ [regression] Keystone memcached backend unstable after switching to
+ dogpile with pylibmc
Changed in fuel:
assignee: MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)
tags: added: swarm-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/325234
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=4eb97031c20625bd4f557caf31037e544f1a7ef8
Submitter: Jenkins
Branch: master

commit 4eb97031c20625bd4f557caf31037e544f1a7ef8
Author: Alexander Makarov <email address hidden>
Date: Fri Jun 3 11:36:22 2016 +0000

    Revert "Set proper cache backend for keystone.conf"

    This reverts commit da0f0dd670824eeddd4580ba1f56099d5d0299b8.

    Change-Id: I7e8458b715a5b8af30134730fb15af2f92dead51
    Closes-Bug: 1587771

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/325233
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d097caf36adefa20a49ab6de2ce8ee2e90856f70
Submitter: Jenkins
Branch: stable/mitaka

commit d097caf36adefa20a49ab6de2ce8ee2e90856f70
Author: Alexander Makarov <email address hidden>
Date: Fri Jun 3 11:33:36 2016 +0000

    Revert "Set proper cache backend for keystone.conf"

    This reverts commit e65f393e419a07ecd1a3cd1e5ade697a792e6617.

    Change-Id: Iaca55d06089de86ce1ffa624bc2a518c6d2937f8
    Closes-Bug: 1587771

Revision history for this message
Ilya Menkov (imenkov) wrote :
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Here is an error example: https://paste.mirantis.net/show/2341/.
this repeats 3 times and after that request goes successfully.

Revision history for this message
Kairat Kushaev (kkushaev) wrote :
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

The example when request succeed and endpoint was found in service catalog:
2016-06-06T03:26:50.659954+00:00 info: 2016-06-06 03:26:50.660 20912 INFO eventlet.wsgi.server [req-cd1efbd3-5758-4605-bad6-2f4b5ef5f41a 929e93a99387410fa8dde7987b469921 780658d4c33844a18124f542ab1cb76d - - -] 10.109.1.5 - - [06/Jun/2016 03:26:50] "PUT /v2/images/53d8e8d8-b482-450a-b90c-ffc68acde429/file HTTP/1.1" 204 208 1.891944

Here glance_store request endpoint:
https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/swift/connection_manager.py#L150

Revision history for this message
Alexander Makarov (amakarov) wrote :

Regression wasn't reproduced - last glance failure is a different bug. Closing this one.

no longer affects: fuel/newton
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Boris Bobrov (bbobrov) wrote :

There are 2 bugs in this bugreport. One was fixed (the one with pylibmc), one is being worked on in bugreport 1589433.

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

It's strange to consider something fixed when the bug symptoms from initial scenario are present.
Anyway, thank you for pointing out to the current issue report where the bug is being fixed.

Revision history for this message
Dina Belova (dbelova) wrote :

I moved the 9.0 milestone here to fix committed as partially bug was fixed here, the other part is stored in the duplicate bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.