Race in mysql TLS enablement, charm has queued hook to update certificates but principal charm attempts DB connection anyway

Bug #1918953 reported by Michael Skalka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Fix Released
High
David Ames
MySQL Router Charm
Invalid
Undecided
Unassigned

Bug Description

As seen in this test run: https://solutions.qa.canonical.com/testruns/testRun/1ca249f8-c359-4a39-9900-a78102891f36
Crashdump: https://oil-jenkins.canonical.com/artifacts/1ca249f8-c359-4a39-9900-a78102891f36/generated/generated/openstack/juju-crashdump-openstack-2021-03-12-05.57.42.tar.gz
Bundle: https://oil-jenkins.canonical.com/artifacts/1ca249f8-c359-4a39-9900-a78102891f36/generated/generated/openstack/bundle.yaml

On Ussuri Focal, nova-cc/2 fails with "shared-db-relation-changed" due to a certificate mismatch:

2021-03-12 05:52:52 ERROR juju-log shared-db:195: list_cells failed
b'An error has occurred:\nTraceback (most recent call last):\n File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 599, in connect\n self._request_authentication()\n File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 803, in _
request_authentication\n self._sock = self.ctx.wrap_socket(self._sock, server_hostname=self.host)\n File "/usr/lib/python3/dist-packages/eventlet/green/ssl.py", line 449, in wrap_socket\n return GreenSSLSocket(sock, *a, _context=self, **kw)\n File "/usr/
lib/python3/dist-packages/eventlet/green/ssl.py", line 139, in __init__\n self.do_handshake()\n File "/usr/lib/python3/dist-packages/eventlet/green/ssl.py", line 311, in do_handshake\n return self._call_trampolining(\n File "/usr/lib/python3/dist-package
s/eventlet/green/ssl.py", line 161, in _call_trampolining\n return func(*a, **kw)\n File "/usr/lib/python3.8/ssl.py", line 1309, in do_handshake\n self._sslobj.do_handshake()\nssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate veri
fy failed: self signed certificate in certificate chain (_ssl.c:1123)\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2276, in _
wrap_pool_connect\n return fn()\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 303, in unique_connection\n return _ConnectionFairy._checkout(self)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 760, in _check
out\n fairy = _ConnectionRecord.checkout(pool)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 492, in checkout\n rec = pool._do_get()\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/impl.py", line 139, in _do_get\n self._
dec_overflow()\n File "/usr/lib/python3/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__\n compat.reraise(exc_type, exc_value, exc_tb)\n File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 153, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/impl.py", line 136, in _do_get\n return self._create_connection()\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 308, in _create_connection\n return _ConnectionRecord(self)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 437, in __init__\n self.__connect(first_connect_check=True)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 639, in __connect\n connection = pool._invoke_creator(self)\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/strategies.py", line 114, in connect\n return dialect.connect(*cargs, **cparams)\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 482, in connect\n return self.dbapi.connect(*cargs, **cparams)\n File "/usr/lib/python3/dist-packages/pymysql/__init__.py", line 94, in Connect\n return Connection(*args, **kwargs)\n File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 325, in __init__\n self.connect()\n File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 630, in connect\n raise exc\npymysql.err.OperationalError: (2003, "Can\'t connect to MySQL server on \'127.0.0.1\' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))")\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/usr/lib/python3/dist-packages/nova/cmd/manage.py", line 2777, in main\n ret = fn(*fn_args, **fn_kwargs)\n File "/usr/lib/python3/dist-packages/nova/cmd/manage.py", line 1207, in list_cells\n cell_mappings = objects.CellMappingList.get_all(\n File "/usr/lib/python3/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper\n result = fn(cls, context, *args, **kwargs)\n File "/usr/lib/python3/dist-packages/nova/objects/cell_mapping.py", line 256, in get_all\n db_mappings = cls._get_all_from_db(context)\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1013, in wrapper\n with self._transaction_scope(context):\n File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__\n return next(self.gen)\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1058, in _transaction_scope\n with current._produce_block(\n File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__\n return next(self.gen)\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 657, in _session\n self.session = self.factory._create_session(\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 418, in _create_session\n self._start()\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 508, in _start\n self._setup_for_connection(\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 533, in _setup_for_connection\n engine = engines.create_engine(\n File "/usr/lib/python3/dist-packages/debtcollector/renames.py", line 43, in decorator\n return wrapped(*args, **kwargs)\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/engines.py", line 201, in create_engine\n test_conn = _test_connection(engine, max_retries, retry_interval)\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/engines.py", line 384, in _test_connection\n raise de_ref\n File "/usr/lib/python3/dist-packages/oslo_db/sqlalchemy/engines.py", line 376, in _test_connection\n return engine.connect()\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2209, in connect\n return self._connection_cls(self, **kwargs)\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 103, in __init__\n else engine.raw_connection()\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2306, in raw_connection\n return self._wrap_pool_connect(\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2279, in _wrap_pool_connect\n Connection._handle_dbapi_exception_noconnection(\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1545, in _handle_dbapi_exception_noconnection\n util.raise_from_cause(newraise, exc_info)\n File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause\n reraise(type(exception), exception, tb=exc_tb, cause=cause)\n File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 152, in reraise\n raise value.with_traceback(tb)\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2276, in _wrap_pool_connect\n return fn()\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 303, in unique_connection\n return _ConnectionFairy._checkout(self)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 760, in _checkout\n fairy = _ConnectionRecord.checkout(pool)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 492, in checkout\n rec = pool._do_get()\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/impl.py", line 139, in _do_get\n self._dec_overflow()\n File "/usr/lib/python3/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__\n compat.reraise(exc_type, exc_value, exc_tb)\n File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 153, in reraise\n raise value\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/impl.py", line 136, in _do_get\n return self._create_connection()\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 308, in _create_connection\n return _ConnectionRecord(self)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 437, in __init__\n self.__connect(first_connect_check=True)\n File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 639, in __connect\n connection = pool._invoke_creator(self)\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/strategies.py", line 114, in connect\n return dialect.connect(*cargs, **cparams)\n File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 482, in connect\n return self.dbapi.connect(*cargs, **cparams)\n File "/usr/lib/python3/dist-packages/pymysql/__init__.py", line 94, in Connect\n return Connection(*args, **kwargs)\n File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 325, in __init__\n self.connect()\n File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 630, in connect\n raise exc\noslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can\'t connect to MySQL server on \'127.0.0.1\' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))")\n(Background on this error at: http://sqlalche.me/e/e3q8)\n\n'
2021-03-12 05:52:53 DEBUG shared-db-relation-changed OpenStack Release: {}
2021-03-12 05:52:53 WARNING shared-db-relation-changed Traceback (most recent call last):
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/hooks/shared-db-relation-changed", line 1402, in <module>
2021-03-12 05:52:53 WARNING shared-db-relation-changed main()
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/hooks/shared-db-relation-changed", line 1394, in main
2021-03-12 05:52:53 WARNING shared-db-relation-changed hooks.execute(sys.argv)
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/charmhelpers/core/hookenv.py", line 945, in execute
2021-03-12 05:52:53 WARNING shared-db-relation-changed self._hooks[hook_name]()
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/hooks/nova_cc_utils.py", line 1685, in wrapped_f
2021-03-12 05:52:53 WARNING shared-db-relation-changed return f(*args, **kwargs)
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/charmhelpers/contrib/openstack/utils.py", line 1765, in wrapped_f
2021-03-12 05:52:53 WARNING shared-db-relation-changed return restart_on_change_helper(
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/charmhelpers/core/host.py", line 749, in restart_on_change_helper
2021-03-12 05:52:53 WARNING shared-db-relation-changed r = lambda_f()
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/charmhelpers/contrib/openstack/utils.py", line 1766, in <lambda>
2021-03-12 05:52:53 WARNING shared-db-relation-changed (lambda: f(*args, **kwargs)), __restart_map_cache['cache'],
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/hooks/shared-db-relation-changed", line 418, in db_changed
2021-03-12 05:52:53 WARNING shared-db-relation-changed update_cell_db_if_ready()
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/hooks/shared-db-relation-changed", line 158, in update_cell_db_if_ready
2021-03-12 05:52:53 WARNING shared-db-relation-changed ncc_utils.update_cell_database()
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/hooks/nova_cc_utils.py", line 881, in update_cell_database
2021-03-12 05:52:53 WARNING shared-db-relation-changed existing_cells = get_cell_details()
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-nova-cloud-controller-2/charm/hooks/nova_cc_utils.py", line 848, in get_cell_details
2021-03-12 05:52:53 WARNING shared-db-relation-changed out = subprocess.check_output(cmd).decode('utf-8')
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
2021-03-12 05:52:53 WARNING shared-db-relation-changed return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2021-03-12 05:52:53 WARNING shared-db-relation-changed File "/usr/lib/python3.8/subprocess.py", line 512, in run
2021-03-12 05:52:53 WARNING shared-db-relation-changed raise CalledProcessError(retcode, process.args,
2021-03-12 05:52:53 WARNING shared-db-relation-changed subprocess.CalledProcessError: Command '['sudo', 'nova-manage', 'cell_v2', 'list_cells', '--verbose']' returned non-zero exit status 255.
2021-03-12 05:52:53 ERROR juju.worker.uniter.operation runhook.go:136 hook "shared-db-relation-changed" (via explicit, bespoke hook script) failed: exit status 1
2021-03-12 05:52:53 DEBUG juju.machinelock machinelock.go:186 machine lock released for nova-cloud-controller/2 uniter (run relation-changed (195; unit: nova-cloud-controller-mysql-router/0) hook)
2021-03-12 05:52:53 DEBUG juju.worker.uniter.operation executor.go:115 lock released for nova-cloud-controller/2
2021-03-12 05:52:53 INFO juju.worker.uniter resolver.go:143 awaiting error resolution for "relation-changed" hook
2021-03-12 05:52:53 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] error: hook failed: "shared-db-relation-changed"
2021-03-12 05:54:37 DEBUG juju.worker.uniter.remotestate watcher.go:636 update status timer triggered for nova-cloud-controller/2
2021-03-12 05:54:37 INFO juju.worker.uniter resolver.go:143 awaiting error resolution for "relation-changed" hook

But the mysql-router charm associated with it has a queued hook to deal with certificates:
2021-03-12 05:56:50 DEBUG juju-log tracer>
tracer: cleared flag is-update-status-hook
tracer: ++ queue handler hooks/relations/mysql-shared/provides.py:42:joined:shared-db
tracer: ++ queue handler hooks/relations/mysql-shared/provides.py:48:changed:shared-db
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:109:broken:certificates
tracer: ++ queue handler reactive/mysql_router_handlers.py:18:db_router_request
tracer: ++ queue handler reactive/mysql_router_handlers.py:68:proxy_shared_db_requests
tracer: ++ queue handler reactive/mysql_router_handlers.py:87:proxy_shared_db_responses

The router charm should not signify that the connection is available until it has processed those certificate relations (and maybe verified the connection to mysql).

Revision history for this message
Michael Skalka (mskalka) wrote :
David Ames (thedac)
Changed in charm-mysql-innodb-cluster:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 21.04
Revision history for this message
David Ames (thedac) wrote :

TRIAGE:

Workaround:
This is a race condition so simply resolving the primary charm's erred hook will succeed.

We have been seeing this type of failure in the vault charm func tests as well.

Here is the issue:

 * mysql-innodb-cluster gets its certificate from vault
 * mysql-innodb-cluster updates db-router with PASSTHROUGH TLS setting and the CA from vault
 * mysql-innodb-cluster begins a cluster wide rolling restart
 * <primary>-mysql-router updates shared-db with the CA
 * <primary> reacts to shared-db-changed and runs a db_sync or some other DB action which fails because the mysql-innodb-cluster has not completed its rolling restart

What needs doing:

* mysql-innodb-cluster should not update its db-router relation until after rolling restarts have completed.

Optional
* mysql-innodb-cluster could optionally inform the db-router relation it is going through a topology change and the mysql-router could subsequently inform the primary its shared-db relation is incomplete to avoid DB actions
** This may cause more churn than it is worth but it should be explored.

I have made several attempts at having msyql-innodb-cluster wait on updating db-router that is thus far a work in progress.

But the problem is well understood.

Changed in charm-mysql-innodb-cluster:
status: Confirmed → Triaged
Changed in charm-mysql-router:
status: New → Triaged
Revision history for this message
David Ames (thedac) wrote :
Changed in charm-mysql-innodb-cluster:
status: Triaged → In Progress
Changed in charm-mysql-router:
status: Triaged → Invalid
Revision history for this message
David Ames (thedac) wrote :
Changed in charm-mysql-innodb-cluster:
status: In Progress → Fix Committed
Changed in charm-mysql-innodb-cluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.