metadata service PicklingError: Can't pickle <class 'sqlalchemy.orm.session.Session'> when using memcached

Bug #1694666 reported by Eduardo Gonzalez on 2017-05-31
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Dan Smith

Bug Description

Description
===========

Launching an instance causes a lot of error messages in nova-api logs.
The instance is not able to retrieve metadata.

How to reproduce
================

Deploy nova master branch.
Spawn an instance.
Wait the instance to be active.
In nova-api logs will see error messages.

Expected results
================

Instance retrieve metadata information

Actual results
==============

Instance is not able to retrieve metadata

Environment configuration
=========================

OpenStack deployed with kolla
Only source images from master fail, binary(rdo or ubuntu packages) works for now
Affected CentOS, Ubuntu and OracleLinux distributions
Database and memcached works as expected, other services consuming them are not affected.

Logs
====

All logs can be found at kolla gates:

Nova: http://logs.openstack.org/73/469373/1/check/gate-kolla-ansible-dsvm-deploy-centos-source-centos-7-nv/8cecb36/logs/kolla/nova/

Neutron: http://logs.openstack.org/73/469373/1/check/gate-kolla-ansible-dsvm-deploy-centos-source-centos-7-nv/8cecb36/logs/kolla/neutron/

Related errors:

Nova API:

2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler [req-3daa8e91-93e5-4676-b77a-048ad3dd53d2 - - - - -] Failed to get metadata for instance id: 8cbd067f-8cd6-4365-b299-3ffc146d0790
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler Traceback (most recent call last):
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/metadata/handler.py", line 285, in _get_meta_by_instance_id
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler remote_address)
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/metadata/handler.py", line 87, in get_metadata_by_instance_id
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler self._cache.set(cache_key, data)
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/cache_utils.py", line 116, in set
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler return self.region.set(key, value)
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/dogpile/cache/region.py", line 973, in set
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler self.backend.set(key, self._value(value))
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/dogpile/cache/backends/memcached.py", line 178, in set
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler **self.set_arguments
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_cache/backends/memcache_pool.py", line 32, in _run_method
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler return getattr(client, __name)(*args, **kwargs)
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/memcache.py", line 740, in set
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler return self._set("set", key, val, time, min_compress_len, noreply)
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/memcache.py", line 1060, in _set
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler return _unsafe_set()
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/memcache.py", line 1034, in _unsafe_set
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler store_info = self._val_to_store_info(val, min_compress_len)
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler File "/var/lib/kolla/venv/lib/python2.7/site-packages/memcache.py", line 998, in _val_to_store_info
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler pickler.dump(val)
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler PicklingError: Can't pickle <class 'sqlalchemy.orm.session.Session'>: it's not the same object as sqlalchemy.orm.session.Session
2017-05-31 09:09:34.703 31 ERROR nova.api.metadata.handler

Neutron metadata:

2017-05-31 09:09:24.985 23 DEBUG neutron.agent.metadata.agent [-] Request: GET /2009-04-04/meta-data/instance-id HTTP/1.0
Accept: */*
Connection: close
Content-Type: text/plain
Host: 169.254.169.254
User-Agent: curl/7.24.0 (x86_64-pc-linux-gnu) libcurl/7.24.0 OpenSSL/1.0.0j zlib/1.2.6
X-Forwarded-For: 10.0.0.5
X-Neutron-Network-Id: 50395d7c-fcc6-4aef-afdc-931c4573b0d1 __call__ /var/lib/kolla/venv/lib/python2.7/site-packages/neutron/agent/metadata/agent.py:86
2017-05-31 09:09:25.269 23 WARNING neutron.agent.metadata.agent [-] Remote metadata server experienced an internal server error.
2017-05-31 09:09:25.269 23 INFO eventlet.wsgi.server [-] 10.0.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 500 len: 361 time: 0.2839720
2017-05-31 09:09:27.460 23 DEBUG eventlet.wsgi.server [-] (23) accepted '' server /var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/wsgi.py:883

Let me know if need more information.

Regards, Eduardo

description: updated
Matt Riedemann (mriedem) wrote :

Does this also happen in Ocata? I'm wondering if this is a regression in Pike (current master branch).

When you say "Only source images from master fail, binary(rdo or ubuntu packages) works for now" - what is the difference? Are the binary packages from Ocata (nova 15.0.0)?

Eduardo Gonzalez (egonzalez90) wrote :

No, in Ocata is not happening:
Latest build and deploy from kolla gates (centos source):

http://logs.openstack.org/55/468655/1/check/gate-kolla-dsvm-deploy-centos-source-centos-7-nv/8fb1bea/logs/kolla/nova/nova-api.txt.gz

Matt Riedemann (mriedem) wrote :

So it looks like oslo.cache configured for dogpile/memcache will use python-memcached:

https://github.com/openstack/oslo.cache/blob/master/setup.cfg#L37

Which uses pickle:

https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L63

Reading http://www.benfrederickson.com/dont-pickle-your-data/ it sounds like that's a bad idea, not only in terms of performance but also security.

It sounds like this has come up before though, and we shouldn't be trying to serialize the entire InstanceMetadata object which contains things inside it like Nova Instance objects, which could contain oslo.db Sessions that have weird c-code or something inside them.

What Nova should probably be doing is creating a smaller primitive of the InstanceMetadata object for serializing. Someone though artom might be working on this, or maybe knows something that is. We could ask him.

Changed in nova:
status: New → Confirmed
summary: metadata service PicklingError: Can't pickle <class
- 'sqlalchemy.orm.session.Session'>
+ 'sqlalchemy.orm.session.Session'> when using memcached
Artom Lifshitz (notartom) wrote :

An idea we agreed on in IRC was to to_primitive() and from_primitive() before/after we hand/get stuff to/from cache. If anyone wants to have a whack at that they're welcome :) Otherwise I'll try to make it happen.

Matt Riedemann (mriedem) on 2017-06-01
Changed in nova:
importance: Undecided → Medium

Fix proposed to branch: master
Review: https://review.openstack.org/469978

Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: Confirmed → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/478991

Changed in nova:
assignee: Artom Lifshitz (notartom) → Dan Smith (danms)

Reviewed: https://review.openstack.org/478991
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2fee972bde4a04d398d32aa6c8b6d27819db697b
Submitter: Jenkins
Branch: master

commit 2fee972bde4a04d398d32aa6c8b6d27819db697b
Author: Dan Smith <email address hidden>
Date: Thu Jun 29 09:42:20 2017 -0700

    Sanitize instance in InstanceMetadata to avoid un-pickleable context

    This is a more strategic fix for the issue of us trying to pickle
    an instance with a context that has complex data structures inside
    (i.e. SQLAlchemy connections) into the oslo cache. The right solution
    is for us to stop storing random python objects (InstanceMetadata)
    in the cache with pickle. However, that's a larger change and more
    complex for deployers to roll over. This attempts to sanitize the
    instance before we pickle it to get things working again.

    Change-Id: Ie7d97ce5c62c8fb9da5822590a64210521f8ae7a
    Closes-Bug: #1694666

Changed in nova:
status: In Progress → Fix Released
Boris Derzhavets (bderzhavets) wrote :

Following topologies tested OK via TripleO QuickStart Master branch overcloud deployments

1. 1xController + 2xComputes
2. 1xController + 1xCompute + 2xCephNodes

It's possible to deploy HA topology, but it seems to be affected by different bug :-
Attaching interface (private subnet) to HA Neutron Router been setup on PCS Controller's
cluster comes every time to Down port status.
Neutron Server log reports :-

2017-07-03 10:04:53.779 68873 WARNING stevedore.named [req-a81938fc-bb4e-4dd3-97ce-a67503e1799c - - - - -] Could not load networking_l2gw.
services.l2gateway.service_drivers.rpc_l2gw.L2gwRpcDriver
2017-07-03 10:04:53.779 68873 WARNING stevedore.named [req-a81938fc-bb4e-4dd3-97ce-a67503e1799c - - - - -]Could not loadnetworking_bgpvpn.
neutron.services.service_drivers.driver_api.BGPVPNDriver
2017-07-03 10:04:53.802 68873 INFO neutron.services.service_base [req-a81938fc-bb4e-4dd3-97ce-a67503e1799c - - - - -]
Default provider is not specified for service type L3_ROUTER_NAT: DefaultServiceProviderNotFound:
Service type L3_ROUTER_NAT does not have a default service provide

I am not sure are this messages related to the issue or not.

Boris Derzhavets (bderzhavets) wrote :

>It's possible to deploy HA topology, but it seems to be affected by different bug :-
>Attaching interface (private subnet) to HA Neutron Router been setup on PCS Controller's

Been built by TripleO QS HA Neutron Router works as expected (normally) no mater that
standard tools detect internal inteface port "DOWN". It is actually "UP"
So,TripleO QS HA overcloud deployment is OK with patches been suggested.

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/479325

This issue was fixed in the openstack/nova 16.0.0.0b3 development milestone.

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/469978
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers