Cinder encounters dbapi error: NoSuchColumnError: "Could not locate column in row for column '%(140070887032400 anon)s.volumes_id'"

Bug #1417018 reported by Sam Wan
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Eric Harney
oslo.db
Fix Released
Medium
Unassigned

Bug Description

2015-02-02 01:20:37.756 6815 CRITICAL cinder [req-7b1344bc-879c-4576-abab-be40294edc2f - - - - -] NoSuchColumnError: "Could not locate column in row for column '%(140070887032400 anon)s.volumes_id'"
2015-02-02 01:20:37.756 6815 TRACE cinder Traceback (most recent call last):
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/bin/cinder-volume", line 9, in <module>
2015-02-02 01:20:37.756 6815 TRACE cinder load_entry_point('cinder==2015.1.dev310', 'console_scripts', 'cinder-volume')()
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/cmd/volume.py", line 69, in main
2015-02-02 01:20:37.756 6815 TRACE cinder binary='cinder-volume')
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/service.py", line 243, in create
2015-02-02 01:20:37.756 6815 TRACE cinder service_name=service_name)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/service.py", line 127, in __init__
2015-02-02 01:20:37.756 6815 TRACE cinder *args, **kwargs)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/volume/manager.py", line 188, in __init__
2015-02-02 01:20:37.756 6815 TRACE cinder context.get_admin_context())
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/volume/manager.py", line 262, in _set_voldb_empty_at_startup_indicator
2015-02-02 01:20:37.756 6815 TRACE cinder None, filters=None)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/db/api.py", line 199, in volume_get_all
2015-02-02 01:20:37.756 6815 TRACE cinder filters=filters)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/db/sqlalchemy/api.py", line 156, in wrapper
2015-02-02 01:20:37.756 6815 TRACE cinder return f(*args, **kwargs)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/db/sqlalchemy/api.py", line 1207, in volume_get_all
2015-02-02 01:20:37.756 6815 TRACE cinder return query.all()
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2320, in all
2015-02-02 01:20:37.756 6815 TRACE cinder return list(self)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 73, in instances
2015-02-02 01:20:37.756 6815 TRACE cinder rows = [process[0](row, None) for row in fetch]
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 363, in _instance
2015-02-02 01:20:37.756 6815 TRACE cinder tuple([row[column] for column in pk_cols])
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 331, in _key_fallback
2015-02-02 01:20:37.756 6815 TRACE cinder expression._string_or_unprintable(key))
2015-02-02 01:20:37.756 6815 TRACE cinder NoSuchColumnError: "Could not locate column in row for column '%(140070887032400 anon)s.volumes_id'"
2015-02-02 01:20:37.756 6815 TRACE cinder

detailed log can be found at http://publiclogs.emc.com/vnx_ostack/EMC_VNX_ISCSI/493/logs/screen-c-vol.txt.gz

# pip list|grep -i mysql-python
MySQL-python (1.2.5)

# pip list|grep cinder
cinder (2015.1.dev310, /opt/stack/new/cinder)

Revision history for this message
Mike Bayer (zzzeek) wrote :

can i get any insight as to under what circumstances this occurs, how often, what would be the best way for me to reproduce, and in an ideal world either/or of a SQL log or what specific version this started happening ? thanks.

Revision history for this message
Mike Bayer (zzzeek) wrote :

a plain run of cinder-volume for me on a fresh devstack doesn't fail in this way. sample logging at http://paste.openstack.org/show/175492/. I'm asking on IRC for info on what additional configurations are needed to produce this error as the log output on this bug report looks very different from mine.

Revision history for this message
Mike Bayer (zzzeek) wrote :

OK I have it reproducing and this is the worst issue possible. cinder is sharing a single MySQL connection between two greenlets. This is an extremely severe problem which I hope is not present in other openstack apps.

I've added extra logging to SQLAlchemy to illustrate thread.get_ident() + MySQL connection identifier and we can see clear overlap of statements on the same connection in two different threads here: http://paste.openstack.org/show/175571/

Revision history for this message
Mike Bayer (zzzeek) wrote :

OK , found the bug.

Cinder is spawning a subprocess using oslo.concurrency and is failing to call create_engine() local to that process, or to dispose the existing engine so that it makes new connections. You *cannot* share a SQLAlchemy connection pool across processes, because they store file descriptors. Calling engine.dispose() when the child fork begins would be sufficient.

Am going to add an error condition to oslo.db and will try to poke around cinder to see if there's some obvious way to patch this.

Revision history for this message
Mike Bayer (zzzeek) wrote :

Steps to reproduce:

1. with a devstack setup, make sure cinder has some extra services turned on (I think ceph is the important one, but this is what worked for me)

[[local|localrc]]
ADMIN_PASSWORD=secrete
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
SERVICE_TOKEN=a682f596-76f3-11e3-b3b2-e716f9080d50
FIXED_RANGE=172.31.1.0/24
FLOATING_RANGE=10.0.0.0/24
HOST_IP=192.168.1.221

# Services {{{1

## prerequisites
ENABLED_SERVICES=rabbit,mysql,key

## ceph
ENABLED_SERVICES+=,ceph
TEMPEST_STORAGE_PROTOCOL=ceph

## nova
ENABLED_SERVICES+=,n-api,n-crt,n-cpu,n-cond,n-sch,n-net

## glance
ENABLED_SERVICES+=,g-api,g-reg

## cinder
ENABLED_SERVICES+=,cinder,c-api,c-sch,c-vol,c-bak
CINDER_ENABLED_BACKENDS=lvm:lvm-1,lvm:lvm-2,ceph:ceph
CINDER_SECURE_DELETE=False

2. run stack fully and everything is running.

3. kill off cinder-volume

sudo killall cinder-volume

4. apply this patch to cinder:

diff --git a/cinder/volume/manager.py b/cinder/volume/manager.py
index e488ecd..24b9619 100644
--- a/cinder/volume/manager.py
+++ b/cinder/volume/manager.py
@@ -287,7 +287,8 @@ class VolumeManager(manager.SchedulerDependentManager):
             # to initialize the driver correctly.
             return

- volumes = self.db.volume_get_all_by_host(ctxt, self.host)
+ for i in range(100):
+ volumes = self.db.volume_get_all_by_host(ctxt, self.host)
         # FIXME volume count for exporting is wrong
         LOG.debug("Re-exporting %s volumes" % len(volumes))

all we're doing here is to crank on that SQL query repeatedly so that it collides with the subproces.

5. Run cinder from the console:

/usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf

Revision history for this message
Mike Bayer (zzzeek) wrote :

normally, I like to beat people over the head with "please call engine.dispose() in a subprocess!!" however, there's a better way which is to just force a reconnect in the pool event. So I can get oslo.db to work around this.

Mike Bayer (zzzeek)
Changed in cinder:
assignee: nobody → Mike Bayer (zzzeek)
Revision history for this message
Mike Bayer (zzzeek) wrote :

the review at https://review.openstack.org/#/c/156725/ will resolve this from the oslo.db side.

Revision history for this message
Victor Stinner (vstinner) wrote :

"Cinder is spawning a subprocess using oslo.concurrency and is failing to call create_engine() local to that process, or to dispose the existing engine so that it makes new connections."

I don't understand how a child process can share data with a parent process if it's simply started with subprocess.Popen. The oslo.concurrency module does the right thing: it calls Popen with close_fds=True, so file descriptors cannot be inherited by the child process.

This issue looks more like a funny bug related to fork. In the traceback of the following email, I see that the evil Service class is used.
http://lists.openstack.org/pipermail/openstack-dev/2015-February/057184.html

The Service class uses os.fork().

IMO the whole issue comes from the crazy design of the Service class: instanciate the application, fork, do some hacks to workaround fork issues. It would be safer to fork and *then* instanciate the class.

Maybe I'm completly wrong, and fork is unrelated to the issue.

Revision history for this message
Mike Bayer (zzzeek) wrote :

Okey doke, as Doug said it's the fork() in common/service.py. I didn't search too hard to see where the actual child pids were coming from. So revising my comment:

"OK , found the bug.

"Cinder is spawning a *child* process using **fork() in cinder/opentack/common/service.py** and is failing to call create_engine() local to that process, or to dispose the existing engine so that it makes new connections. You *cannot* share a SQLAlchemy connection pool across processes, because they store file descriptors. Calling engine.dispose() when the child fork begins would be sufficient."

Revision history for this message
Viktor Serhieiev (vsergeyev) wrote :
Changed in oslo.db:
assignee: nobody → Mike Bayer (zzzeek)
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
Doug Hellmann (doug-hellmann) wrote :

Related fix to service code in the incubator: https://review.openstack.org/157608

Changed in oslo-incubator:
milestone: none → kilo-3
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Mike Bayer (zzzeek) wrote :

the fixes here are known, though I'm not sure what approach oslo.db will take. Cinder developers should likely ensure that the engine is refreshed within their child fork to fix the issue immediately. Note also related issues such as https://bugs.launchpad.net/cinder/+bug/1430859.

Changed in cinder:
assignee: Mike Bayer (zzzeek) → nobody
Changed in oslo.db:
assignee: Mike Bayer (zzzeek) → nobody
Eric Harney (eharney)
Changed in cinder:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/163551

Changed in cinder:
assignee: nobody → Eric Harney (eharney)
status: Confirmed → In Progress
Changed in cinder:
assignee: Eric Harney (eharney) → John Griffith (john-griffith)
Eric Harney (eharney)
Changed in cinder:
milestone: none → kilo-3
assignee: John Griffith (john-griffith) → Eric Harney (eharney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/163551
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=2cc0dbe3936e21f52c27bc945e2d999c40526b32
Submitter: Jenkins
Branch: master

commit 2cc0dbe3936e21f52c27bc945e2d999c40526b32
Author: Eric Harney <email address hidden>
Date: Wed Mar 11 14:00:44 2015 -0400

    Fix sqlalchemy reuse in multi-backend children

    Multi-backend breaks because sqlalchemy objects
    are shared between child backends.

    When creating a new service (Base), call dispose
    to reset the sqlalchemy object and connection.

    Change-Id: I9d937c5b2fa850edc5523f26e031f59cad5a9e7e
    Closes-Bug: #1417018

Changed in cinder:
status: In Progress → Fix Committed
Changed in oslo.db:
milestone: none → next-kilo
Changed in oslo-incubator:
milestone: kilo-3 → liberty-1
Changed in oslo.db:
status: In Progress → Fix Committed
Changed in oslo.db:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
milestone: kilo-3 → 2015.1.0
no longer affects: oslo-incubator
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.