Unexpected API Error from nova-api (keystone related) after restarting galera and rabbit

Bug #1642944 reported by Raoul Scarazzini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Undecided
Unassigned

Bug Description

While testing a TripleO environment I stop the core resources (Galera, Redis and Rabbit), then I wait 20 minutes and if I don't get problems from the related services from a systemd perspective and from the cluster side (failed actions) I restart those core services. If everything goes fine, then test is passed.
If what I described happens with success I then try to do another test, which is basically to deploy an instance. And while trying to do this, trying to spawn a new instance, I get this error:

ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'sqlalchemy.exc.OperationalError'> (HTTP 500) (Request-ID: req-71c32e29-7dae-4e63-bff8-b5a0610e0f2b)

As I wrote, the status of all the involved resources is active, but looking into the logs the first thing that seem to have problems is keystone:

2016-11-17 14:38:52.366 185174 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
2016-11-17 14:38:52.372 185174 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "An unexpected error prevented the server from fulfilling your request.", "code": 500, "title": "Internal Server Error"}}
2016-11-17 14:38:52.372 185174 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server
2016-11-17 14:38:52.373 185174 INFO nova.osapi_compute.wsgi.server [-] 172.17.0.19 "GET /v2.1/servers/detail?all_tenants=True&changes-since=2016-11-17T14%3A18%3A52.302477%2B00%3A00&host=overcloud-novacompute-0 HTTP/1.1" status: 503 len: 375 time: 0.0635540
2016-11-17 14:48:52.907 185174 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
2016-11-17 14:48:52.907 185174 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "An unexpected error prevented the server from fulfilling your request.", "code": 500, "title": "Internal Server Error"}}
2016-11-17 14:48:52.907 185174 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server

This seem to lead to this additional error, which is basically what I get on console:

2016-11-17 14:48:52.908 185174 INFO nova.osapi_compute.wsgi.server [-] 172.17.0.19 "GET /v2.1/servers/detail?all_tenants=True&changes-since=2016-11-17T14%3A18%3A52.880129%2B00%3A00&host=overcloud-novacompute-1 HTTP/1.1" status: 503 len: 375 time: 0.0174298
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters [req-71c32e29-7dae-4e63-bff8-b5a0610e0f2b a24fe0ff114449b2a226002bc4087ad8 5b04fae1e53748c3bc3215e9687c1c7b - default default] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1']
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters context)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement, parameters)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters result = self._query(query)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters conn.query(q)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 781, in query
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 942, in _read_query_result
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters result.read()
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters first_packet = self.connection._read_packet()
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters packet.check_error()
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters err.raise_mysql_exception(self._data)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters _check_mysql_exception(errinfo)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters raise InternalError(errno, errorvalue)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters InternalError: (1927, u'Connection was killed')
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters
2016-11-17 14:52:45.829 185174 ERROR nova.api.openstack.extensions [req-71c32e29-7dae-4e63-bff8-b5a0610e0f2b a24fe0ff114449b2a226002bc4087ad8 5b04fae1e53748c3bc3215e9687c1c7b - default default] Unexpected exception in API method

This is not expected, since even if we are stopping and then starting core services like galera and rabbitmq all the services must be configured to support this kind of outages.

Revision history for this message
Raoul Scarazzini (rasca) wrote :
Revision history for this message
Mike Bayer (zzzeek) wrote :

OK I can add a lot of info to this.

First off, the error here is occurring when oslo.db goes to "ping" the database at the start of the transaction, upon which if the database is down, it will reconnect. The exception we are seeing here is a normal event when the database has been restarted and the nova service is expected to resume with the new database.

What is different here is the error code 1927 is not one of the error codes that either SQLAlchemy or oslo.db is aware of, which indicates that the client has been disconnected. The 1927 error code is first off listed at mariadb's site as a Mariadb-only error code, which already makes it very unique (https://mariadb.com/kb/en/mariadb/mariadb-error-codes/) - all the MySQL client disconnect error codes are 2000- numbered.

The Pymysql driver also is surprised here as it normally raises OperationalError, not InternalError, for a disconnect situation; it normally generates disconnect errors itself based on socket conditions, since the TCP connection is normally gone; but in this case the mariadb server seems to still be communicating with the client, but the state of the connection itself is "killed". I'm not understanding the mechanics of that yet here.

We can patch oslo.db as well as SQLAlchemy to both detect this as a disconnect situation, most expediently we can do it in oslo.db first which here would allow the nova service to appropriately reconnect on startup.

Revision history for this message
Mike Bayer (zzzeek) wrote :

I have a review up for oslo.db at https://review.openstack.org/#/c/400269/.

Revision history for this message
Mike Bayer (zzzeek) wrote :

what's not understood yet is:

1. <class 'sqlalchemy.exc.OperationalError'> (HTTP 500) (Request-ID: req-71c32e29-7dae-4e63-bff8-b5a0610e0f2b) what is that OperationalError, different DB error on the keystone side? Any OperationalError in the keystone logs?

2. how is it that nova is still connected to the mariadb server to get a "connection is killed" error?

Revision history for this message
Raoul Scarazzini (rasca) wrote :

Even if I don't have an answer around the two questions above, I can say that the patch Mike did solves the problem.
I tested it in the same way described in the bug and the problem did not happen again.

Revision history for this message
Mike Bayer (zzzeek) wrote :
Download full text (14.4 KiB)

OK here is another variant of the error, pymysql is pretty raw about the connection so we've now observed the packet sequence being messed up, I'm not sure what we can do other than add this one to the "disconnect" filter as well since certainly if the packet is out of sequence, that connection is not usable:

2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters [req-cf6d524d-daa9-4ea3-b191-9992116870a4 bad22650507e4d21bc9b68ae1eafa6d8 d8334f01bac249279754c5e618982347 - default default] DBAPIError exception wrapped from (pymysql.err.InternalError) Packet sequence number wrong - got 0 expected 1 [SQL: u'SELECT 1']
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters context)
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement, parameters)
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 166, in execute
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters result = self._query(query)
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 322, in _query
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters conn.query(q)
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 841, in query
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1029, in _read_query_result
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters result.read()
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1312, in read
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters first_packet = self.connection._read_packet()
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 978, in _read_packet
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters (packet_number, self._next_seq_id))
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters InternalError: Packet sequence number wrong - got 0 expected 1
2016-12-14 12:49:44.362 155213 ERROR oslo_db.sqlalchemy.exc_filters
2016-12-14 12:49:44.364 155213 ERROR nova.api.openstack.extensions [req-cf6d524d-daa9-4ea3-b191-9992116870a4 bad22650507e4d21bc9b68ae1eafa6d8 d8334f01bac24927975...

Revision history for this message
Raoul Scarazzini (rasca) wrote :

Ok, with this new review https://review.openstack.org/#/c/410881/ problem did not happen in further tests I've done, so we can say it solves the issue.

Revision history for this message
Emilien Macchi (emilienm) wrote :

This bug was last updated over 180 days ago, as tripleo is a fast moving project and we'd like to get the tracker down to currently actionable bugs, this is getting marked as Invalid. If the issue still exists, please feel free to reopen it.

Changed in tripleo:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.