Cinder

Cinder encounters dbapi error: NoSuchColumnError: "Could not locate column in row for column '%(140070887032400 anon)s.volumes_id'"

Bug #1417018 reported by Sam Wan on 2015-02-02

This bug affects 7 people

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Fix Released	High	Eric Harney	Cinder 2015.1.0 "kilo"
	oslo.db	Fix Released	Medium	Unassigned	oslo.db 1.7.1

Bug Description

2015-02-02 01:20:37.756 6815 CRITICAL cinder [req-7b1344bc-879c-4576-abab-be40294edc2f - - - - -] NoSuchColumnError: "Could not locate column in row for column '%(140070887032400 anon)s.volumes_id'"
2015-02-02 01:20:37.756 6815 TRACE cinder Traceback (most recent call last):
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/bin/cinder-volume", line 9, in <module>
2015-02-02 01:20:37.756 6815 TRACE cinder load_entry_point('cinder==2015.1.dev310', 'console_scripts', 'cinder-volume')()
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/cmd/volume.py", line 69, in main
2015-02-02 01:20:37.756 6815 TRACE cinder binary='cinder-volume')
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/service.py", line 243, in create
2015-02-02 01:20:37.756 6815 TRACE cinder service_name=service_name)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/service.py", line 127, in __init__
2015-02-02 01:20:37.756 6815 TRACE cinder *args, **kwargs)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/volume/manager.py", line 188, in __init__
2015-02-02 01:20:37.756 6815 TRACE cinder context.get_admin_context())
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/volume/manager.py", line 262, in _set_voldb_empty_at_startup_indicator
2015-02-02 01:20:37.756 6815 TRACE cinder None, filters=None)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/db/api.py", line 199, in volume_get_all
2015-02-02 01:20:37.756 6815 TRACE cinder filters=filters)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/db/sqlalchemy/api.py", line 156, in wrapper
2015-02-02 01:20:37.756 6815 TRACE cinder return f(*args, **kwargs)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/opt/stack/new/cinder/cinder/db/sqlalchemy/api.py", line 1207, in volume_get_all
2015-02-02 01:20:37.756 6815 TRACE cinder return query.all()
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2320, in all
2015-02-02 01:20:37.756 6815 TRACE cinder return list(self)
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 73, in instances
2015-02-02 01:20:37.756 6815 TRACE cinder rows = [process[0](row, None) for row in fetch]
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 363, in _instance
2015-02-02 01:20:37.756 6815 TRACE cinder tuple([row[column] for column in pk_cols])
2015-02-02 01:20:37.756 6815 TRACE cinder File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 331, in _key_fallback
2015-02-02 01:20:37.756 6815 TRACE cinder expression._string_or_unprintable(key))
2015-02-02 01:20:37.756 6815 TRACE cinder NoSuchColumnError: "Could not locate column in row for column '%(140070887032400 anon)s.volumes_id'"
2015-02-02 01:20:37.756 6815 TRACE cinder

detailed log can be found at http://publiclogs.emc.com/vnx_ostack/EMC_VNX_ISCSI/493/logs/screen-c-vol.txt.gz

# pip list|grep -i mysql-python
MySQL-python (1.2.5)

# pip list|grep cinder
cinder (2015.1.dev310, /opt/stack/new/cinder)

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-13:

can i get any insight as to under what circumstances this occurs, how often, what would be the best way for me to reproduce, and in an ideal world either/or of a SQL log or what specific version this started happening ? thanks.

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-16:

a plain run of cinder-volume for me on a fresh devstack doesn't fail in this way. sample logging at http://paste.openstack.org/show/175492/. I'm asking on IRC for info on what additional configurations are needed to produce this error as the log output on this bug report looks very different from mine.

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-17:

OK I have it reproducing and this is the worst issue possible. cinder is sharing a single MySQL connection between two greenlets. This is an extremely severe problem which I hope is not present in other openstack apps.

I've added extra logging to SQLAlchemy to illustrate thread.get_ident() + MySQL connection identifier and we can see clear overlap of statements on the same connection in two different threads here: http://paste.openstack.org/show/175571/

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-17:

OK , found the bug.

Cinder is spawning a subprocess using oslo.concurrency and is failing to call create_engine() local to that process, or to dispose the existing engine so that it makes new connections. You *cannot* share a SQLAlchemy connection pool across processes, because they store file descriptors. Calling engine.dispose() when the child fork begins would be sufficient.

Am going to add an error condition to oslo.db and will try to poke around cinder to see if there's some obvious way to patch this.

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-17:

Steps to reproduce:

1. with a devstack setup, make sure cinder has some extra services turned on (I think ceph is the important one, but this is what worked for me)

[[local|localrc]]
ADMIN_PASSWORD=secrete
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
SERVICE_TOKEN=a682f596-76f3-11e3-b3b2-e716f9080d50
FIXED_RANGE=172.31.1.0/24
FLOATING_RANGE=10.0.0.0/24
HOST_IP=192.168.1.221

# Services {{{1

## prerequisites
ENABLED_SERVICES=rabbit,mysql,key

## ceph
ENABLED_SERVICES+=,ceph
TEMPEST_STORAGE_PROTOCOL=ceph

## nova
ENABLED_SERVICES+=,n-api,n-crt,n-cpu,n-cond,n-sch,n-net

## glance
ENABLED_SERVICES+=,g-api,g-reg

## cinder
ENABLED_SERVICES+=,cinder,c-api,c-sch,c-vol,c-bak
CINDER_ENABLED_BACKENDS=lvm:lvm-1,lvm:lvm-2,ceph:ceph
CINDER_SECURE_DELETE=False

2. run stack fully and everything is running.

3. kill off cinder-volume

sudo killall cinder-volume

4. apply this patch to cinder:

diff --git a/cinder/volume/manager.py b/cinder/volume/manager.py
index e488ecd..24b9619 100644
--- a/cinder/volume/manager.py
+++ b/cinder/volume/manager.py
@@ -287,7 +287,8 @@ class VolumeManager(manager.SchedulerDependentManager):
# to initialize the driver correctly.
return

- volumes = self.db.volume_get_all_by_host(ctxt, self.host)
+ for i in range(100):
+ volumes = self.db.volume_get_all_by_host(ctxt, self.host)
# FIXME volume count for exporting is wrong
LOG.debug("Re-exporting %s volumes" % len(volumes))

all we're doing here is to crank on that SQL query repeatedly so that it collides with the subproces.

5. Run cinder from the console:

/usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-17:

normally, I like to beat people over the head with "please call engine.dispose() in a subprocess!!" however, there's a better way which is to just force a reconnect in the pool event. So I can get oslo.db to work around this.

Mike Bayer (zzzeek) on 2015-02-17

Changed in cinder:
assignee:	nobody → Mike Bayer (zzzeek)

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-17:

the review at https://review.openstack.org/#/c/156725/ will resolve this from the oslo.db side.

Revision history for this message

Victor Stinner (vstinner) wrote on 2015-02-18:

"Cinder is spawning a subprocess using oslo.concurrency and is failing to call create_engine() local to that process, or to dispose the existing engine so that it makes new connections."

I don't understand how a child process can share data with a parent process if it's simply started with subprocess.Popen. The oslo.concurrency module does the right thing: it calls Popen with close_fds=True, so file descriptors cannot be inherited by the child process.

This issue looks more like a funny bug related to fork. In the traceback of the following email, I see that the evil Service class is used.
http://lists.openstack.org/pipermail/openstack-dev/2015-February/057184.html

The Service class uses os.fork().

IMO the whole issue comes from the crazy design of the Service class: instanciate the application, fork, do some hacks to workaround fork issues. It would be safer to fork and *then* instanciate the class.

Maybe I'm completly wrong, and fork is unrelated to the issue.

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-02-19:

Okey doke, as Doug said it's the fork() in common/service.py. I didn't search too hard to see where the actual child pids were coming from. So revising my comment:

"OK , found the bug.

"Cinder is spawning a *child* process using **fork() in cinder/opentack/common/service.py** and is failing to call create_engine() local to that process, or to dispose the existing engine so that it makes new connections. You *cannot* share a SQLAlchemy connection pool across processes, because they store file descriptors. Calling engine.dispose() when the child fork begins would be sufficient."

Revision history for this message

Viktor Serhieiev (vsergeyev) wrote on 2015-02-20:

#10

Fix in oslo.db - https://review.openstack.org/#/c/156725/

Changed in oslo.db:
assignee:	nobody → Mike Bayer (zzzeek)
status:	New → In Progress
importance:	Undecided → Medium

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2015-03-10:

#11

Related fix to service code in the incubator: https://review.openstack.org/157608

Changed in oslo-incubator:
milestone:	none → kilo-3
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Mike Bayer (zzzeek) wrote on 2015-03-11:

#12

the fixes here are known, though I'm not sure what approach oslo.db will take. Cinder developers should likely ensure that the engine is refreshed within their child fork to fix the issue immediately. Note also related issues such as https://bugs.launchpad.net/cinder/+bug/1430859.

Changed in cinder:
assignee:	Mike Bayer (zzzeek) → nobody
Changed in oslo.db:
assignee:	Mike Bayer (zzzeek) → nobody

Eric Harney (eharney) on 2015-03-11

Changed in cinder:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-11: Fix proposed to cinder (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/163551

Changed in cinder:
assignee:	nobody → Eric Harney (eharney)
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2015-03-11

Changed in cinder:
assignee:	Eric Harney (eharney) → John Griffith (john-griffith)

Eric Harney (eharney) on 2015-03-12

Changed in cinder:
milestone:	none → kilo-3
assignee:	John Griffith (john-griffith) → Eric Harney (eharney)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-12: Fix merged to cinder (master)

#14

Reviewed: https://review.openstack.org/163551
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=2cc0dbe3936e21f52c27bc945e2d999c40526b32
Submitter: Jenkins
Branch: master

commit 2cc0dbe3936e21f52c27bc945e2d999c40526b32
Author: Eric Harney <email address hidden>
Date: Wed Mar 11 14:00:44 2015 -0400

Fix sqlalchemy reuse in multi-backend children

Multi-backend breaks because sqlalchemy objects
are shared between child backends.

When creating a new service (Base), call dispose
to reset the sqlalchemy object and connection.

Change-Id: I9d937c5b2fa850edc5523f26e031f59cad5a9e7e
Closes-Bug: #1417018

Changed in cinder:
status:	In Progress → Fix Committed

Doug Hellmann (doug-hellmann) on 2015-03-17

Changed in oslo.db:
milestone:	none → next-kilo
Changed in oslo-incubator:
milestone:	kilo-3 → liberty-1

Viktor Serhieiev (vsergeyev) on 2015-03-17

Changed in oslo.db:
status:	In Progress → Fix Committed

Doug Hellmann (doug-hellmann) on 2015-03-17

Changed in oslo.db:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-03-20

Changed in cinder:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in cinder:
milestone:	kilo-3 → 2015.1.0

Davanum Srinivas (DIMS) (dims-v) on 2015-11-14

no longer affects:

oslo-incubator

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.