After MySQL connection failure (or timeout), first request reports MySQL Server has gone away, following requests work

Bug #943031 reported by justinsb on 2012-02-29
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Glance
High
Peng Yong
OpenStack Compute (nova)
High
Peng Yong
OpenStack Dashboard (Horizon)
High
Tres Henry
OpenStack Identity (keystone)
High
Peng Yong
neutron
High
Peng Yong
quantum (Ubuntu)
Undecided
Unassigned
Nominated for Precise by Yolanda Robla

Bug Description

After a prolonged period of inactivity, it looks like the MySQL connection times out but Keystone (Redux) does not reconnect

...
  File "/usr/lib/pymodules/python2.7/routes/middleware.py", line 131, in __call__
    response = self.app(environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 159, in __call__
    return resp(environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 159, in __call__
    return resp(environ, start_response)
  File "/usr/lib/pymodules/python2.7/routes/middleware.py", line 131, in __call__
    response = self.app(environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 159, in __call__
    return resp(environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 147, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 208, in call_func
    return self.func(req, *args, **kwargs)
  File "/opt/stack/keystone/keystone/common/wsgi.py", line 180, in __call__
    result = method(context, **params)
  File "/opt/stack/keystone/keystone/service.py", line 231, in authenticate
    context=context, user_name=username)
  File "/opt/stack/keystone/keystone/common/manager.py", line 34, in _wrapper
    return f(*args, **kw)
  File "/opt/stack/keystone/keystone/identity/backends/sql.py", line 188, in get_user_by_name
    return _filter_user(self._get_user_by_name(user_name))
  File "/opt/stack/keystone/keystone/identity/backends/sql.py", line 179, in _get_user_by_name
    user_ref = session.query(User).filter_by(name=user_name).first()
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1652, in first
    ret = list(self[0:1])
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1560, in __getitem__
    return list(res)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1721, in __iter__
    return self._execute_and_instances(context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 1726, in _execute_and_instances
    mapper=self._mapper_zero_or_none())
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 724, in execute
    clause, params or {})
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1191, in execute
    params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
    return self.__execute_context(context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
    context.parameters[0], context=context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 299, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/pymodules/python2.7/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/pymodules/python2.7/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
OperationalError: (OperationalError) (2006, 'MySQL server has gone away') 'SELECT user.id AS user_id, user.name AS user_name, user.extra AS user_extra \nFROM user \nWHERE user.name = %s \n LIMIT 0, 1' ('demo',)

Jay Pipes (jaypipes) wrote :

Not sure if this is related to redux... Nati and I have seen it happen quite a bit in TryStack, and that is deployed with Diablo, not Essex... so this might be something vestigial?

Joseph Heck (heckj) wrote :

justinsb - could you let us know what specific version of keystone you saw this with (either commitish or date you checked it out or something)?

Changed in keystone:
status: New → Incomplete

I 've seen this on master recently, I think the commit was: https://github.com/openstack/keystone/commit/b69885596ba03d7e307082e902e541a85ca68fe0. Are you working on this? I think this should make into Essex, If nobody is working on this I can do it, borrowing some code from nova (see https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/session.py#L47 and https://github.com/openstack/nova/blob/master/nova/exception.py#L87) it should be quite easy.

Bye

As additional note this easily reproducilble just restarting MySQL, then next request will fail and the following will work. In my previous post I pontied to using wrap_db_error, which I think would be great, but probably not the best place to solve this. After researching a bit, it seems using SQLAlchemy pool (I think pool is only configured for SQLite and using sqlalchemy.pool.NullPool) with a suitable 'recycle' value.

Joseph Heck (heckj) wrote :

Confirming Rafael's details - it does reconnect, but the first attempt through it throwing a failure, subsequent queries work correctly.

summary: - Need auto-reconnect: MySQL Server has gone away
+ After MySQL connection failure (or timeout), first request reports MySQL
+ Server has gone away, following requests work
Changed in keystone:
status: Incomplete → Confirmed
importance: Undecided → Medium

Since it seems nobody is working so I will do it myself. I think two possible approach:
- Wrap DB calls and catch the error, retrying when we detect MySQL has gone away error. In addition it might be useful wait some time before retrying since in some cases, e.g.: server reboot, MySQL might not be available intermediately but after some time, so it can be worthy adding a flag how many seconds before re-trying.

-DB pool and suitable 'recycle' value should work too on long idle connections, but I'm not sure if this would fix the case where MySQL is rebooted too. In this case, it might be worthy too adding a flag, so DB pool class is configurable.

For now, I will borrow some code from Nova for wrapping DB calls since this can fix all situations and has some extra advantages. Any thought about this?

Changed in keystone:
assignee: nobody → Rafael Durán Castañeda (rafadurancastaneda)
Joseph Heck (heckj) wrote :

Rafael -

sounds like a good plan of attack - thank you for working on this!

You are welcomed Joseph. Wrapping query and flush (as Nova does) first approach is no working, I will try to find another way later today.

Download full text (10.5 KiB)

I've noticed same error on Glance, so I think thys might be affecting not just keystone but all "SQL enabled" services. Do I must report new bugs?

After restarting MySQL:

glance -A 9841b5bc284d478c97fad6ca6932efaf index

Failed to show index. Got error:
Internal Server error: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 336, in handle_one_response
    result = self.application(self.environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 147, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 208, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/glance/common/wsgi.py", line 279, in __call__
    response = req.get_response(self.application)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1053, in get_response
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1022, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/keystone/middleware/auth_token.py", line 166, in __call__
    return self.app(env, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 147, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 208, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/glance/common/wsgi.py", line 279, in __call__
    response = req.get_response(self.application)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1053, in get_response
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1022, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 159, in __call__
    return resp(environ, start_response)
  File "/usr/lib/pymodules/python2.7/routes/middleware.py", line 131, in __call__
    response = self.app(environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 159, in __call__
    return resp(environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 147, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 208, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/glance/common/wsgi.py", line 477, in __call__
    request, **action_args)
  File "/usr/lib/python2.7/dist-packages/glance/common/wsgi.py", line 494, in dispatch
    return method(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/glance/api/v1/images.py", line 135, in index
    images = registry.get_images_list(req.context, **params)
  File "/usr/lib/python2.7/dist-packages/glance/registry/__init__.py", line 103, in get_images_list
    return c.get_images(**kwargs)
  File "/usr/lib/python2.7/dist-packages/glance/registry/client.py", line 74, in get_images
    res = self.do_request...

Peng Yong (ppyy) wrote :

bug confirmed

Peng Yong (ppyy) wrote :
Peng Yong (ppyy) on 2012-03-11
Changed in keystone:
status: Confirmed → In Progress

Hi,

Just some comments since I was working on this too. So far I've just got the same solution as Peng Yong (no exactly the same implementation, but very close) and I don't like this "pessimistic ask permission approach"; however it seems to be the only working solution. Peng Yong I've saw your patches and your comments on git review, why don't you use just "select 1" instead of "select now()"?

Bye

Peng Yong (ppyy) wrote :

update the patch. thanks Rafael Durán Castañeda

Dolph Mathews (dolph) wrote :

Re-assigning based on progress in merge prop.

Changed in keystone:
assignee: Rafael Durán Castañeda (rafadurancastaneda) → Peng Yong (ppyy)
Changed in keystone:
milestone: none → essex-rc1
importance: Medium → High
Changed in nova:
importance: Undecided → High
assignee: nobody → Peng Yong (ppyy)
milestone: none → essex-rc1
status: New → In Progress

Reviewed: https://review.openstack.org/5213
Committed: http://github.com/openstack/nova/commit/42a96d7ca0d6dc1fe5c048e67d5649bf5b513d0d
Submitter: Jenkins
Branch: master

commit 42a96d7ca0d6dc1fe5c048e67d5649bf5b513d0d
Author: Peng Yong <email address hidden>
Date: Sun Mar 11 11:51:31 2012 +0800

    refactoring code, check connection in Listener. refer to Bug #943031

    Change-Id: Ic04d1a918fedd5fd71f58af8e27b58deef070ee2

Changed in nova:
status: In Progress → Fix Committed
dan wendlandt (danwent) on 2012-03-13
Changed in quantum:
status: New → In Progress
assignee: nobody → Peng Yong (ppyy)
milestone: none → essex-rc1
importance: Undecided → High

Reviewed: https://review.openstack.org/5212
Committed: http://github.com/openstack/keystone/commit/e65a22c43a7fe44621080cee01f394c90b54320d
Submitter: Jenkins
Branch: master

commit e65a22c43a7fe44621080cee01f394c90b54320d
Author: Peng Yong <email address hidden>
Date: Sun Mar 11 10:35:15 2012 +0800

    Bug #943031 MySQL Server has gone away
    added docnotes of error messages caught for mysql and reference

    Change-Id: I147b32193436be891e54e36c6adc1b16fda886d3

Changed in keystone:
status: In Progress → Fix Committed
Unmesh Gurjar (unmesh-gurjar) wrote :

Is this fix going to be merged in the 'milestone-proposed' branch?

Thierry Carrez (ttx) wrote :

The milestone-proposed branch will be rebased on the master branch at RC1 time, so yes.

Reviewed: https://review.openstack.org/5220
Committed: http://github.com/openstack/quantum/commit/189b89a9c87b1eae9f55e5b8c98cf907d3acd61d
Submitter: Jenkins
Branch: master

commit 189b89a9c87b1eae9f55e5b8c98cf907d3acd61d
Author: Peng Yong <email address hidden>
Date: Mon Mar 12 00:06:53 2012 +0800

    check connection in Listener. refer to Bug #943031

    Change-Id: I5a0c975ab7998627a213ac4c69c037e9e2d95bfa

Changed in quantum:
status: In Progress → Fix Committed
Devin Carlen (devcamcar) on 2012-03-16
Changed in horizon:
status: New → Confirmed
importance: Undecided → High
milestone: none → essex-rc1
assignee: nobody → Tres Henry (tres)
Tres Henry (tres) wrote :

Verified on Horizon. To repro I set wait_timeout in my.conf to 5 seconds and then tried to reproduce. Looks good.

Changed in horizon:
status: Confirmed → Fix Committed
Tres Henry (tres) wrote :

I can still verify the bug against Glance. Adding Glance to the also affects list.

Fix proposed to branch: master
Review: https://review.openstack.org/5486

Changed in glance:
assignee: nobody → Peng Yong (ppyy)
status: New → In Progress
Peng Yong (ppyy) wrote :
Peng Yong (ppyy) wrote :

@tres, i have't find fix for horizon, can you give me the git version?

Tres Henry (tres) wrote :

No fix required for Horizon since it no longer requires MySQL, however, since Horizon is downstream from the rest of OpenStack it surfaced this bug across multiple services which is why it's in the affected list.

Thierry Carrez (ttx) on 2012-03-19
Changed in quantum:
status: Fix Committed → Fix Released
Jay Pipes (jaypipes) on 2012-03-19
Changed in glance:
milestone: none → essex-rc1
importance: Undecided → High

Reviewed: https://review.openstack.org/5486
Committed: http://github.com/openstack/glance/commit/1f0c4ba67aced9a7ad857cddc0602b243bf38a10
Submitter: Jenkins
Branch: master

commit 1f0c4ba67aced9a7ad857cddc0602b243bf38a10
Author: Peng Yong <email address hidden>
Date: Mon Mar 19 00:23:03 2012 +0800

    check connection in Listener. refer to Bug #943031

    Change-Id: I78ec4304a4122c2854799d42737ba3b7bbb76137

Changed in glance:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-03-20
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-03-20
Changed in horizon:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-03-21
Changed in glance:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-03-23
Changed in keystone:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in glance:
milestone: essex-rc1 → 2012.1
Thierry Carrez (ttx) on 2012-04-05
Changed in keystone:
milestone: essex-rc1 → 2012.1
Thierry Carrez (ttx) on 2012-04-05
Changed in horizon:
milestone: essex-rc1 → 2012.1
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-rc1 → 2012.1
Thierry Carrez (ttx) on 2012-04-05
Changed in quantum:
milestone: essex-rc1 → 2012.1
Changed in quantum (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers