Stale oslo.cache leads to false-positive cache misses

Bug #1589433 reported by Yury Tregubov
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Alexander Makarov
Mitaka
Fix Released
Critical
Alex Schultz

Bug Description

Deployment of the env with below node's roles is failed since 9.0 mos iso #427:

1 Controller
1 Compute + Cinder
1 MongoDB

The previous success deploy was on mos 9.0 iso #424.

The following errors from puppet are seen in logs on affected controller:

016-06-05 22:17:45 ERR /usr/lib/ruby/vendor_ruby/puppet/util/errors.rb:106:in `fail'
2016-06-05 22:17:45 ERR /bin/bash "/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]
2016-06-05 22:17:35 ERR (/Stage[main]/Main/Exec[upload_cirros_shell]/returns) change from notrun to 0 failed: /bin/bash "/etc/puppet/shell_manifests/upload_cirros_command.sh" returned 1 instead of one of [0]

Also on node with MongoDB:

/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager Traceback (most recent call last):
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/ceilometer/agent/manager.py", line 458, in discover
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager discovered = discoverer.discover(self, param)
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/ceilometer/network/services/discovery.py", line 68, in discover
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager loadbalancers = self.neutron_cli.list_loadbalancer()
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/ceilometer/neutron_client.py", line 401, in list_loadbalancer
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager resp = self.client.list_loadbalancers()
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 97, in with_params
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager ret = self.function(instance, *args, **kwargs)
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 1045, in list_loadbalancers
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager retrieve_all, **_params)
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager File "/usr/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 373, in list
/var/log/remote/10.109.2.3/ceilometer-polling.log:2016-06-06 08:53:05.924 10630 ERROR ceilometer.agent.manager for r in self._pagination(collection, path, **params):

/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:05.709647+00:00 err: 21:29:05.219092 #1484] ERROR -- : pluginmanager.rb:171:in `rescue in loadclass' Failed to load MCollective::Agent::Puppetd: cannot load such file -- puppet
/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:05.709728+00:00 err: 21:29:05.219197 #1484] ERROR -- : agents.rb:71:in `rescue in loadagent' Loading agent puppetd failed: cannot load such file -- puppet
/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:27.865469+00:00 err: 21:29:27.139772 #1795] ERROR -- : pluginmanager.rb:171:in `rescue in loadclass' Failed to load MCollective::Agent::Puppetd: cannot load such file -- puppet
/var/log/remote/node-3.test.domain.local/bootstrap/mcollective.log:2016-06-05T21:29:27.865564+00:00 err: 21:29:27.139866 #1795] ERROR -- : agents.rb:71:in `rescue in loadagent' Loading agent puppetd failed: cannot load such file -- puppet
/var/log/shotgun.log:2016-06-05 22:18:16 ERROR 11374 (manager) Clearing target failed. function: <built-in function listdir>, path: /var/dump, excinfo: (<type 'exceptions.OSError'>, OSError(2, 'No such file or directory'), <traceback object at 0x2e567e8>)
/var/log/shotgun.log:2016-06-05 22:18:16 ERROR 11374 (manager) Clearing target failed. function: <built-in function rmdir>, path: /var/dump, excinfo: (<type 'exceptions.OSError'>, OSError(2, 'No such file or directory'), <traceback object at 0x2e56ea8>)

The problem is reproduced so far. Diagnostic snapshot from iso #443 is attached.
The other environments without MongoDB node are deployed without problems at the moment.

Revision history for this message
Yury Tregubov (ytregubov) wrote :
summary: - Deployment of env with mongodb is failed
+ Deployment of env with mongodb and ceilometer is failed
tags: added: area-library
tags: added: module-astute module-nailgun
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
importance: Undecided → Critical
description: updated
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/newton
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote : Re: Deployment of env with mongodb and ceilometer is failed

445 - not reproduced, closing as Invalid.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

I believe this might be a duplicate of Bug 1589433

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Alex, it is different bug.

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → MOS Keystone (mos-keystone)
Andrey Maximov (maximov)
tags: removed: module-astute module-nailgun
Revision history for this message
Alexander Makarov (amakarov) wrote :

Keystone client in glance process can't extract swift endpoint out of token catalog,
so work is under way to understand why.
The problem to reproduec the bug is that it vanishes after some time after deployment.
Yury is about to add token data logging to keystone client so that it's possible to see tokens catalog in the glance log.

Changed in fuel:
assignee: MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)
Revision history for this message
Alexander Makarov (amakarov) wrote :

Another concern:
glance (along with ironic and ceilometer) has a habit to ask keystone client for endpoints with interface type "internalUrl" even using keystone v3 API, where is no such endpoint type - "internal" it's name there.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Alexander, does it mean that it is an issue in Glance or in incorrect configuration of endpoints?

Changed in fuel:
status: In Progress → Confirmed
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Status changed to Confirmed because we don't have any fixes on review yet.

Priority if Critical because the issue fails deployments.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Do we have ETA for the fix?

Dmitry Pyzhov (dpyzhov)
tags: removed: area-library
Revision history for this message
Alexander Makarov (amakarov) wrote :

Timur, we need custom iso with additional logging to say something.
Yury Tregubov is working on it.

Revision history for this message
Alexander Makarov (amakarov) wrote :

Update to #7:
KSC normalizes endpoint type, so 'internal' and 'internalURL' are both valid for v3 (and for v2 too)
https://github.com/openstack/python-keystoneclient/blob/master/keystoneclient/service_catalog.py#L387-L391

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/mitaka)

Change abandoned by Boris Bobrov (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/326640
Reason: doesn't actually fix.

Revision history for this message
Dina Belova (dbelova) wrote : Re: Deployment of env with mongodb and ceilometer is failed

ok, so the most recent result

== TL; DR == Root cause is not clear ==

1) revert of the https://review.openstack.org/#/c/320845/ does not help, the fact Alex Schulz did not observe deployment failure during his tests is just a coincidence - the bug is not 100% reproducible. There are 3 retry attempts to upload this image, and during today we’ve seen all variants - from not successful deployment to successful with first attempt failed and second - ok. Therefore Boris is abandoning these reverts.
2) from puppet side everything looks ok
3) all endpoints are created correctly ~10 minutes before glance is asking for swift endpoint. these endpoints do not change during the deployment process and nobody is influencing on them
4) still image upload might fail with the error msg listed in the bug. Alex M. is on task with add required logging almost everywhere and try to catch something useful on keystone client side

tags: added: swarm-blocker
Revision history for this message
Alexander Makarov (amakarov) wrote :
Revision history for this message
Alexander Makarov (amakarov) wrote :

Endpoints have following format:
{
"url": "http://10.109.1.6:8776/v3/7f071c08330e4e38be602307f5131f89",
"interface": "internal", "region": "RegionOne",
"region_id": "RegionOne",
"id": "68d1c11b4e124e63bd8086b15d317e4c"
}

Error message states:

Cannot find swift service endpoint : internalURL endpoint for object-store service in RegionOne region not found.

It looks for "object-store" service, which is absent in the endpoint structure above.

To be continued...

Revision history for this message
Alexander Makarov (amakarov) wrote :

Service type is stored 1 level above, so previous concern is invalid.

Revision history for this message
Alexander Makarov (amakarov) wrote :

Let's try to include in error message those service types available on failure: https://review.fuel-infra.org/#/c/21779/

Revision history for this message
Alexander Makarov (amakarov) wrote :

Well, catalog is incomplete:
2016-06-08 20:13:59.047 23387 ERROR glance.api.v2.image_data BackendException: Cannot find swift service endpoint : internalURL endpoint for object-store service in RegionOne region not found among [u'compute', u'network', u'volumev2', u'volumev3', u'alarming', u'image', u'metering', u'artifact', u'compute_legacy', u'volume', u'cloudformation', u'orchestration', u'identity'] among [u'compute', u'network', u'volumev2', u'volumev3', u'alarming', u'image', u'metering', u'artifact', u'compute_legacy', u'volume', u'cloudformation', u'orchestration', u'identity']

There is no object-storage service type in this list.
But it gets in catalog afterwards.
Need to investigate deployment sequence.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

I could not reproduce the issue with 4 redeployments. Each time swift endpoint got added to keystone 15-20 minutes before cirros image wa uploaded to glance.

Revision history for this message
Dina Belova (dbelova) wrote :

Just random note: Keystone is caching service catalog inside

https://github.com/openstack/keystone/blob/stable/mitaka/keystone/catalog/core.py#L296
https://github.com/openstack/keystone/blob/stable/mitaka/keystone/catalog/core.py#L303

Default expiration time is 600 sec, so glance may consume old cached service catalog... I'm not sure how this might happen in fact and what service was asking for this catalog for the first time to get the wrong (not filled yet) value cached. As keystone is #1 service to we installed. So this might be as fantastic, as other proposals.

Revision history for this message
Alexander Makarov (amakarov) wrote :

Deploying with https://review.fuel-infra.org/#/c/21888/
It adds logging of catalog service types both that added to the token issued and that stored in database, retrieved bypassing cache.

Revision history for this message
Dina Belova (dbelova) wrote :

OK, it looks like we were able to find the root cause of the issue.

================
What's happening
================

Here is the actions flow we're facing on the failed environments:
- we're calling glance to upload an image
- glance is using swift client to grab swift's endpoint in case if Keystone v3 is used - https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/swift/connection_manager.py#L150
- swift client is searching the endpoint in the service catalog - https://github.com/openstack/python-swiftclient/blob/master/swiftclient/client.py#L583-L589 - in fact, operation of service catalog extraction is cached (10 minutes expiration time) and there might be old service catalog snapshot used.
- if the HW is quick enough, deployment is passing really quick (like we have on our test environments - full redeploy is passing during ~25 minutes, so after cache is expired 1-2 times at max during the deployment operation) -> the cache might not expire and update the service catalog prior the image upload operations starts -> we face a failure

=================
What can be done?
=================

1) set less service catalog expiration time in keystone config - *No! Let's not do it!* Service catalog is needed while token creation -> this will affect token management time -> that's really a bad idea
2) add more retrying logic for the image upload - this is good something to start with. Let's do it
3) find out why service catalog is not invalidated if we're asking for the endpoint and fix it - this is also a good idea, let's do it as well. This will prevent issues like this in future.

Revision history for this message
Alexander Makarov (amakarov) wrote :

This is from the keystone custom log:

Populating service catalog with [u'compute_legacy', u'volume', u'alarming', u'identity', u'volumev2', u'metering', u'compute', u'artifact', u'image', u'volumev3', u'network'], types in db: [u'compute_legacy', u'object-store', u'cloudformation', u'volume', u'alarming', u'identity', u'volumev2', u'metering', u'compute', u'artifact', u's3', u'image', u'orchestration', u'volumev3', u'network']

There is no object-store in catalog attached to token, but it is in database. So it's clearly a cache invalidation issue.

Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

I think we also need to add retries on images uploading to this script:
/etc/puppet/modules/osnailyfacter/modular/astute/upload_cirros.rb
Please also note that broken image after failed attempt should be removed from Glance

Revision history for this message
Alexander Makarov (amakarov) wrote :

Root cause seems to be the following:

Catalog is cached using dogpile.cache CacheRegion.
Every time catalog is changed, the cache region instance gets invalidated:
https://bitbucket.org/zzzeek/dogpile.cache/src/c6fc55cd04f3a9e7e1301c21fa54c2e17d918c8b/dogpile/cache/region.py?at=master&fileviewer=file-view-default#region.py-304:346

As we can see in the code above this invalidation is done by storing self._hard_invalidated & self._soft_invalidated in the CacheRegion instance. Values in actual memcached backend are untouched.

This has the following effect: all other CacheRegion instances just unaware of the invalidation and continue to return stale data.

Filed a bug: https://bugs.launchpad.net/oslo.cache/+bug/1590779

Dina Belova (dbelova)
tags: added: area-keystone
Revision history for this message
Alex Schultz (alex-schultz) wrote :

So there are two options.

1) flush memcache during the deployment after endpoint creation to make sure we don't get a stale catalog
2) add cleanup logic to upload_cirros with extra retries & retry delays to work around the stale catalog issue.

ETA for initial code for both items is EOD US 6/9

There are possible risks with both items:

The risks for flushing memcache is that there may be performance issues related to the complete flushing of memcache when scaling the cluster.

The risks for the additional timeouts is that the upload_cirros.rb script may be relying on the current implementation and by changing the flow of the script to do cleanups we might be adding additional failures in other scenarios.

Revision history for this message
Dina Belova (dbelova) wrote :

Once both variants will be proposed ^^ we will test both several times against three environments we have, where the issue is stably reproducing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/327792

Revision history for this message
Alexander Makarov (amakarov) wrote : Re: Deployment of env with mongodb and ceilometer is failed
Revision history for this message
Alex Schultz (alex-schultz) wrote :

As a follow up to #27, The upload_cirros.rb workarounds have been proposed. In looking into it further I don't think flushing all of memcache is a viable options so we should not do that. Ideally the fix to oslo.cache is a true fix while the upload_cirros.rb retries is just a workaround. We should spend time testing the proposed oslo cache fix from #30 and the upload_cirros.rb workarounds in https://review.openstack.org/327792

summary: - Deployment of env with mongodb and ceilometer is failed
+ Stale oslo.cache leads to false-positive cache misses
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/327792
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=e8e923c375c469a61ac9b355e0343c3bbe75c9fe
Submitter: Jenkins
Branch: stable/mitaka

commit e8e923c375c469a61ac9b355e0343c3bbe75c9fe
Author: Alex Schultz <email address hidden>
Date: Thu Jun 9 10:04:29 2016 -0600

    Improve reliability of upload_cirros.rb

    Previously we did not make sure the glance image was in an active state
    before exiting the upload_cirros.rb script. This change checks that the
    image upload was successful and the image is reporting active before
    returning success. Additionally this change increases the number of
    times we attempt to upload the image within the script and increases the
    task retries as well. We will attempt to cleanup the glance image
    between uploads as well if there are any failures.

    Change-Id: I17187cc6ec596a2631b958a9636ca984128b8cd7
    Partial-Bug: #1589433

tags: added: in-stable-mitaka
Revision history for this message
Dina Belova (dbelova) wrote :

Fix committed ^^

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/keystone (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Boris Bobrov <email address hidden>
Review: https://review.fuel-infra.org/21938

tags: added: on-verification
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

cat /etc/fuel_build_id:
 481
cat /etc/fuel_build_number:
 481
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-misc-9.0.0-1.mos8458.noarch
 python-packetary-9.0.0-1.mos140.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-migrate-9.0.0-1.mos8458.noarch
 rubygem-astute-9.0.0-1.mos750.noarch
 fuel-mirror-9.0.0-1.mos140.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-openstack-metadata-9.0.0-1.mos8742.noarch
 fuel-notify-9.0.0-1.mos8458.noarch
 nailgun-mcagents-9.0.0-1.mos750.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-utils-9.0.0-1.mos8458.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8742.noarch
 fuel-library9.0-9.0.0-1.mos8458.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2717.noarch
 fuel-ostf-9.0.0-1.mos936.noarch
 fuelmenu-9.0.0-1.mos274.noarch
 fuel-nailgun-9.0.0-1.mos8742.noarch

tags: removed: on-verification
Boris Bobrov (bbobrov)
Changed in fuel:
milestone: 10.0 → 9.1
tags: added: swarm-fail
removed: swarm-blocker
Revision history for this message
Alexander Makarov (amakarov) wrote :

Waiting for backport of upstream fix: https://review.openstack.org/#/c/349704/

Revision history for this message
Alexander Makarov (amakarov) wrote :

Won't be ready for 9.1

Changed in fuel:
milestone: 9.1 → 9.2
Revision history for this message
Boris Bobrov (bbobrov) wrote :

This was fixed in 9.1.

Changed in fuel:
status: Confirmed → Invalid
status: Invalid → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.