Fuel for OpenStack

Stale oslo.cache leads to false-positive cache misses

Bug #1589433 reported by Yury Tregubov on 2016-06-06

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	Critical	Alexander Makarov	Fuel for OpenStack 9.2
	Mitaka	Fix Released	Critical	Alex Schultz	Fuel for OpenStack 9.0

Bug Description

Deployment of the env with below node's roles is failed since 9.0 mos iso #427:

1 Controller
1 Compute + Cinder
1 MongoDB

The previous success deploy was on mos 9.0 iso #424.

The following errors from puppet are seen in logs on affected controller:

016-06-05 22:17:45 ERR /usr/lib/ruby/vendor_ruby/puppet/util/errors.rb:106:in `fail'
2016-06-05 22:17:45 ERR /bin/bash "/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]
2016-06-05 22:17:35 ERR (/Stage[main]/Main/Exec[upload_cirros_shell]/returns) change from notrun to 0 failed: /bin/bash "/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]

Also on node with MongoDB:

/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager Traceback (most recent call last):
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/ceilometer/agent/manager.py", line 458, in discover
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager discovered = discoverer.discover(self, param)
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/ceilometer/network/services/discovery.py", line 68, in discover
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager loadbalancers = self.neutron_cli.list_loadbalancer()
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/ceilometer/neutron_client.py", line 401, in list_loadbalancer
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager resp = self.client.list_loadbalancers()
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 97, in with_params
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager ret = self.function(instance, *args, **kwargs)
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 1045, in list_loadbalancers
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager retrieve_all, **_params)
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 373, in list
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager for r in self._pagination(collection, path, **params):

/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:05.709647+00:00 err: 21:29:05.219092 #1484] ERROR -- : pluginmanager.rb:171:in `rescue in loadclass' Failed to load MCollective::Agent::Puppetd: cannot load such file -- puppet
/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:05.709728+00:00 err: 21:29:05.219197 #1484] ERROR -- : agents.rb:71:in `rescue in loadagent' Loading agent puppetd failed: cannot load such file -- puppet
/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:27.865469+00:00 err: 21:29:27.139772 #1795] ERROR -- : pluginmanager.rb:171:in `rescue in loadclass' Failed to load MCollective::Agent::Puppetd: cannot load such file -- puppet
/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:27.865564+00:00 err: 21:29:27.139866 #1795] ERROR -- : agents.rb:71:in `rescue in loadagent' Loading agent puppetd failed: cannot load such file -- puppet
/var/log/shotgun.log:2016-06-05 22:18:16 ERROR 11374 (manager) Clearing target failed. function: <built-in function listdir>, path: /var/dump, excinfo: (<type 'exceptions.OSError'>, OSError(2, 'No such file or directory'), <traceback object at 0x2e567e8>)
/var/log/shotgun.log:2016-06-05 22:18:16 ERROR 11374 (manager) Clearing target failed. function: <built-in function rmdir>, path: /var/dump, excinfo: (<type 'exceptions.OSError'>, OSError(2, 'No such file or directory'), <traceback object at 0x2e56ea8>)

The problem is reproduced so far. Diagnostic snapshot from iso #443 is attached.
The other environments without MongoDB node are deployed without problems at the moment.

See original description

Tags:

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2016-06-06:

9.0 iso #443 disagnostic Edit (36.1 MiB, application/x-tar)

summary:

- Deployment of env with mongodb is failed
+ Deployment of env with mongodb and ceilometer is failed

Aleksey Zvyagintsev (azvyagintsev) on 2016-06-06

tags:	added: area-library
tags:	added: module-astute module-nailgun
Changed in fuel:
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)

Timur Nurlygayanov (tnurlygayanov) on 2016-06-06

Changed in fuel:
importance:	Undecided → Critical

Yury Tregubov (ytregubov) on 2016-06-06

description:

updated

Dmitry Pyzhov (dpyzhov) on 2016-06-06

no longer affects:

fuel/newton

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-06-06: Re: Deployment of env with mongodb and ceilometer is failed

445 - not reproduced, closing as Invalid.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-06-06:

I believe this might be a duplicate of Bug 1589433

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-06-06:

Sorry, Bug 1587771

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-06-07:

Alex, it is different bug.

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → MOS Keystone (mos-keystone)

Andrey Maximov (maximov) on 2016-06-07

tags:

removed: module-astute module-nailgun

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-07:

Keystone client in glance process can't extract swift endpoint out of token catalog,
so work is under way to understand why.
The problem to reproduec the bug is that it vanishes after some time after deployment.
Yury is about to add token data logging to keystone client so that it's possible to see tokens catalog in the glance log.

Changed in fuel:
assignee:	MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-07:

Another concern:
glance (along with ironic and ceilometer) has a habit to ask keystone client for endpoints with interface type "internalUrl" even using keystone v3 API, where is no such endpoint type - "internal" it's name there.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-08:

Alexander, does it mean that it is an issue in Glance or in incorrect configuration of endpoints?

Changed in fuel:
status:	In Progress → Confirmed

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-08:

Status changed to Confirmed because we don't have any fixes on review yet.

Priority if Critical because the issue fails deployments.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-08:

#10

Do we have ETA for the fix?

Dmitry Pyzhov (dpyzhov) on 2016-06-08

tags:

removed: area-library

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-08:

#11

Timur, we need custom iso with additional logging to say something.
Yury Tregubov is working on it.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-08:

#12

Update to #7:
KSC normalizes endpoint type, so 'internal' and 'internalURL' are both valid for v3 (and for v2 too)
https://github.com/openstack/python-keystoneclient/blob/master/keystoneclient/service_catalog.py#L387-L391

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-08: Change abandoned on fuel-library (stable/mitaka)

#13

Change abandoned by Boris Bobrov (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/326640
Reason: doesn't actually fix.

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-08: Re: Deployment of env with mongodb and ceilometer is failed

#14

ok, so the most recent result

== TL; DR == Root cause is not clear ==

1) revert of the https://review.openstack.org/#/c/320845/ does not help, the fact Alex Schulz did not observe deployment failure during his tests is just a coincidence - the bug is not 100% reproducible. There are 3 retry attempts to upload this image, and during today we’ve seen all variants - from not successful deployment to successful with first attempt failed and second - ok. Therefore Boris is abandoning these reverts.
2) from puppet side everything looks ok
3) all endpoints are created correctly ~10 minutes before glance is asking for swift endpoint. these endpoints do not change during the deployment process and nobody is influencing on them
4) still image upload might fail with the error msg listed in the bug. Alex M. is on task with add required logging almost everywhere and try to catch something useful on keystone client side

Sergey Shevorakov (sshevorakov) on 2016-06-08

tags:

added: swarm-blocker

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-08:

#15

Token logged:
https://paste.mirantis.net/show/2357/

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-08:

#16

Endpoints have following format:
{
"url": "http://10.109.1.6:8776/v3/7f071c08330e4e38be602307f5131f89",
"interface": "internal", "region": "RegionOne",
"region_id": "RegionOne",
"id": "68d1c11b4e124e63bd8086b15d317e4c"
}

Error message states:

Cannot find swift service endpoint : internalURL endpoint for object-store service in RegionOne region not found.

It looks for "object-store" service, which is absent in the endpoint structure above.

To be continued...

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-08:

#17

Service type is stored 1 level above, so previous concern is invalid.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-08:

#18

Let's try to include in error message those service types available on failure: https://review.fuel-infra.org/#/c/21779/

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-08:

#19

Well, catalog is incomplete:
2016-06-08 20:13:59.047 23387 ERROR glance.api.v2.image_data BackendException: Cannot find swift service endpoint : internalURL endpoint for object-store service in RegionOne region not found among [u'compute', u'network', u'volumev2', u'volumev3', u'alarming', u'image', u'metering', u'artifact', u'compute_legacy', u'volume', u'cloudformation', u'orchestration', u'identity'] among [u'compute', u'network', u'volumev2', u'volumev3', u'alarming', u'image', u'metering', u'artifact', u'compute_legacy', u'volume', u'cloudformation', u'orchestration', u'identity']

There is no object-storage service type in this list.
But it gets in catalog afterwards.
Need to investigate deployment sequence.

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2016-06-09:

#20

I could not reproduce the issue with 4 redeployments. Each time swift endpoint got added to keystone 15-20 minutes before cirros image wa uploaded to glance.

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-09:

#21

Just random note: Keystone is caching service catalog inside

https://github.com/openstack/keystone/blob/stable/mitaka/keystone/catalog/core.py#L296
https://github.com/openstack/keystone/blob/stable/mitaka/keystone/catalog/core.py#L303

Default expiration time is 600 sec, so glance may consume old cached service catalog... I'm not sure how this might happen in fact and what service was asking for this catalog for the first time to get the wrong (not filled yet) value cached. As keystone is #1 service to we installed. So this might be as fantastic, as other proposals.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-09:

#22

Deploying with https://review.fuel-infra.org/#/c/21888/
It adds logging of catalog service types both that added to the token issued and that stored in database, retrieved bypassing cache.

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-09:

#23

OK, it looks like we were able to find the root cause of the issue.

================
What's happening
================

Here is the actions flow we're facing on the failed environments:
- we're calling glance to upload an image
- glance is using swift client to grab swift's endpoint in case if Keystone v3 is used - https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/swift/connection_manager.py#L150
- swift client is searching the endpoint in the service catalog - https://github.com/openstack/python-swiftclient/blob/master/swiftclient/client.py#L583-L589 - in fact, operation of service catalog extraction is cached (10 minutes expiration time) and there might be old service catalog snapshot used.
- if the HW is quick enough, deployment is passing really quick (like we have on our test environments - full redeploy is passing during ~25 minutes, so after cache is expired 1-2 times at max during the deployment operation) -> the cache might not expire and update the service catalog prior the image upload operations starts -> we face a failure

=================
What can be done?
=================

1) set less service catalog expiration time in keystone config - *No! Let's not do it!* Service catalog is needed while token creation -> this will affect token management time -> that's really a bad idea
2) add more retrying logic for the image upload - this is good something to start with. Let's do it
3) find out why service catalog is not invalidated if we're asking for the endpoint and fix it - this is also a good idea, let's do it as well. This will prevent issues like this in future.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-09:

#24

This is from the keystone custom log:

Populating service catalog with [u'compute_legacy', u'volume', u'alarming', u'identity', u'volumev2', u'metering', u'compute', u'artifact', u'image', u'volumev3', u'network'], types in db: [u'compute_legacy', u'object-store', u'cloudformation', u'volume', u'alarming', u'identity', u'volumev2', u'metering', u'compute', u'artifact', u's3', u'image', u'orchestration', u'volumev3', u'network']

There is no object-store in catalog attached to token, but it is in database. So it's clearly a cache invalidation issue.

Revision history for this message

Sergey Kolekonov (skolekonov) wrote on 2016-06-09:

#25

I think we also need to add retries on images uploading to this script:
/etc/puppet/modules/osnailyfacter/modular/astute/upload_cirros.rb
Please also note that broken image after failed attempt should be removed from Glance

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-09:

#26

Root cause seems to be the following:

Catalog is cached using dogpile.cache CacheRegion.
Every time catalog is changed, the cache region instance gets invalidated:
https://bitbucket.org/zzzeek/dogpile.cache/src/c6fc55cd04f3a9e7e1301c21fa54c2e17d918c8b/dogpile/cache/region.py?at=master&fileviewer=file-view-default#region.py-304:346

As we can see in the code above this invalidation is done by storing self._hard_invalidated & self._soft_invalidated in the CacheRegion instance. Values in actual memcached backend are untouched.

This has the following effect: all other CacheRegion instances just unaware of the invalidation and continue to return stale data.

Filed a bug: https://bugs.launchpad.net/oslo.cache/+bug/1590779

Dina Belova (dbelova) on 2016-06-09

tags:

added: area-keystone

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-06-09:

#27

So there are two options.

1) flush memcache during the deployment after endpoint creation to make sure we don't get a stale catalog
2) add cleanup logic to upload_cirros with extra retries & retry delays to work around the stale catalog issue.

ETA for initial code for both items is EOD US 6/9

There are possible risks with both items:

The risks for flushing memcache is that there may be performance issues related to the complete flushing of memcache when scaling the cluster.

The risks for the additional timeouts is that the upload_cirros.rb script may be relying on the current implementation and by changing the flow of the script to do cleanups we might be adding additional failures in other scenarios.

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-09:

#28

Once both variants will be proposed ^^ we will test both several times against three environments we have, where the issue is stably reproducing.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-09: Fix proposed to fuel-library (stable/mitaka)

#29

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/327792

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-06-09: Re: Deployment of env with mongodb and ceilometer is failed

#30

Patches for oslo.cache:
upstream: https://review.openstack.org/#/c/327776/
downstream: https://review.fuel-infra.org/21912

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-06-09:

#31

As a follow up to #27, The upload_cirros.rb workarounds have been proposed. In looking into it further I don't think flushing all of memcache is a viable options so we should not do that. Ideally the fix to oslo.cache is a true fix while the upload_cirros.rb retries is just a workaround. We should spend time testing the proposed oslo cache fix from #30 and the upload_cirros.rb workarounds in https://review.openstack.org/327792

Vladimir Kuklin (vkuklin) on 2016-06-09

summary:

- Deployment of env with mongodb and ceilometer is failed
+ Stale oslo.cache leads to false-positive cache misses

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-09: Fix merged to fuel-library (stable/mitaka)

#32

Reviewed: https://review.openstack.org/327792
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=e8e923c375c469a61ac9b355e0343c3bbe75c9fe
Submitter: Jenkins
Branch: stable/mitaka

commit e8e923c375c469a61ac9b355e0343c3bbe75c9fe
Author: Alex Schultz <email address hidden>
Date: Thu Jun 9 10:04:29 2016 -0600

Improve reliability of upload_cirros.rb

    Previously we did not make sure the glance image was in an active state
    before exiting the upload_cirros.rb script. This change checks that the
    image upload was successful and the image is reporting active before
    returning success. Additionally this change increases the number of
    times we attempt to upload the image within the script and increases the
    task retries as well. We will attempt to cleanup the glance image
    between uploads as well if there are any failures.

Change-Id: I17187cc6ec596a2631b958a9636ca984128b8cd7
Partial-Bug: #1589433

tags:

added: in-stable-mitaka

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-10:

#33

Fix committed ^^

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-06-10: Fix proposed to openstack/keystone (9.0/mitaka)

#34

Fix proposed to branch: 9.0/mitaka
Change author: Boris Bobrov <email address hidden>
Review: https://review.fuel-infra.org/21938

Sofiia Andriichenko (sandriichenko) on 2016-06-14

tags:

added: on-verification

Revision history for this message

Sofiia Andriichenko (sandriichenko) wrote on 2016-06-14:

#35

cat /etc/fuel_build_id:
481
cat /etc/fuel_build_number:
481
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
fuel-release-9.0.0-1.mos6349.noarch
fuel-misc-9.0.0-1.mos8458.noarch
python-packetary-9.0.0-1.mos140.noarch
fuel-bootstrap-cli-9.0.0-1.mos285.noarch
fuel-migrate-9.0.0-1.mos8458.noarch
rubygem-astute-9.0.0-1.mos750.noarch
fuel-mirror-9.0.0-1.mos140.noarch
shotgun-9.0.0-1.mos90.noarch
fuel-openstack-metadata-9.0.0-1.mos8742.noarch
fuel-notify-9.0.0-1.mos8458.noarch
nailgun-mcagents-9.0.0-1.mos750.noarch
python-fuelclient-9.0.0-1.mos325.noarch
fuel-9.0.0-1.mos6349.noarch
fuel-utils-9.0.0-1.mos8458.noarch
fuel-setup-9.0.0-1.mos6349.noarch
fuel-provisioning-scripts-9.0.0-1.mos8742.noarch
fuel-library9.0-9.0.0-1.mos8458.noarch
network-checker-9.0.0-1.mos74.x86_64
fuel-agent-9.0.0-1.mos285.noarch
fuel-ui-9.0.0-1.mos2717.noarch
fuel-ostf-9.0.0-1.mos936.noarch
fuelmenu-9.0.0-1.mos274.noarch
fuel-nailgun-9.0.0-1.mos8742.noarch

tags:

removed: on-verification

Boris Bobrov (bbobrov) on 2016-07-07

Changed in fuel:
milestone:	10.0 → 9.1

Sergey Shevorakov (sshevorakov) on 2016-08-04

tags:

added: swarm-fail
removed: swarm-blocker

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-08-31:

#36

Waiting for backport of upstream fix: https://review.openstack.org/#/c/349704/

Revision history for this message

Alexander Makarov (amakarov) wrote on 2016-09-06:

#37

Won't be ready for 9.1

Changed in fuel:
milestone:	9.1 → 9.2

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2016-11-24:

#38

This was fixed in 9.1.

Changed in fuel:
status:	Confirmed → Invalid
status:	Invalid → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

9.0 iso #443 disagnostic Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.