While testing a TripleO environment I stop the core resources (Galera, Redis and Rabbit), then I wait 20 minutes and if I don't get problems from the related services from a systemd perspective and from the cluster side (failed actions) I restart those core services. If everything goes fine, then test is passed.
If what I described happens with success I then try to do another test, which is basically to deploy an instance. And while trying to do this, trying to spawn a new instance, I get this error:
ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'sqlalchemy.exc.OperationalError'> (HTTP 500) (Request-ID: req-71c32e29-7dae-4e63-bff8-b5a0610e0f2b)
As I wrote, the status of all the involved resources is active, but looking into the logs the first thing that seem to have problems is keystone:
2016-11-17 14:38:52.366 185174 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
2016-11-17 14:38:52.372 185174 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "An unexpected error prevented the server from fulfilling your request.", "code": 500, "title": "Internal Server Error"}}
2016-11-17 14:38:52.372 185174 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server
2016-11-17 14:38:52.373 185174 INFO nova.osapi_compute.wsgi.server [-] 172.17.0.19 "GET /v2.1/servers/detail?all_tenants=True&changes-since=2016-11-17T14%3A18%3A52.302477%2B00%3A00&host=overcloud-novacompute-0 HTTP/1.1" status: 503 len: 375 time: 0.0635540
2016-11-17 14:48:52.907 185174 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
2016-11-17 14:48:52.907 185174 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "An unexpected error prevented the server from fulfilling your request.", "code": 500, "title": "Internal Server Error"}}
2016-11-17 14:48:52.907 185174 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server
This seem to lead to this additional error, which is basically what I get on console:
2016-11-17 14:48:52.908 185174 INFO nova.osapi_compute.wsgi.server [-] 172.17.0.19 "GET /v2.1/servers/detail?all_tenants=True&changes-since=2016-11-17T14%3A18%3A52.880129%2B00%3A00&host=overcloud-novacompute-1 HTTP/1.1" status: 503 len: 375 time: 0.0174298
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters [req-71c32e29-7dae-4e63-bff8-b5a0610e0f2b a24fe0ff114449b2a226002bc4087ad8 5b04fae1e53748c3bc3215e9687c1c7b - default default] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1']
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters context)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement, parameters)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters result = self._query(query)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters conn.query(q)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 781, in query
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 942, in _read_query_result
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters result.read()
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters first_packet = self.connection._read_packet()
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters packet.check_error()
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters err.raise_mysql_exception(self._data)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters _check_mysql_exception(errinfo)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters raise InternalError(errno, errorvalue)
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters InternalError: (1927, u'Connection was killed')
2016-11-17 14:52:45.826 185174 ERROR oslo_db.sqlalchemy.exc_filters
2016-11-17 14:52:45.829 185174 ERROR nova.api.openstack.extensions [req-71c32e29-7dae-4e63-bff8-b5a0610e0f2b a24fe0ff114449b2a226002bc4087ad8 5b04fae1e53748c3bc3215e9687c1c7b - default default] Unexpected exception in API method
This is not expected, since even if we are stopping and then starting core services like galera and rabbitmq all the services must be configured to support this kind of outages.
OK I can add a lot of info to this.
First off, the error here is occurring when oslo.db goes to "ping" the database at the start of the transaction, upon which if the database is down, it will reconnect. The exception we are seeing here is a normal event when the database has been restarted and the nova service is expected to resume with the new database.
What is different here is the error code 1927 is not one of the error codes that either SQLAlchemy or oslo.db is aware of, which indicates that the client has been disconnected. The 1927 error code is first off listed at mariadb's site as a Mariadb-only error code, which already makes it very unique (https:/ /mariadb. com/kb/ en/mariadb/ mariadb- error-codes/) - all the MySQL client disconnect error codes are 2000- numbered.
The Pymysql driver also is surprised here as it normally raises OperationalError, not InternalError, for a disconnect situation; it normally generates disconnect errors itself based on socket conditions, since the TCP connection is normally gone; but in this case the mariadb server seems to still be communicating with the client, but the state of the connection itself is "killed". I'm not understanding the mechanics of that yet here.
We can patch oslo.db as well as SQLAlchemy to both detect this as a disconnect situation, most expediently we can do it in oslo.db first which here would allow the nova service to appropriately reconnect on startup.