Possible race in mysql connection, db migration fails when mysql-innodb reports ready

Bug #1915842 reported by Michael Skalka
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Placement Charm
Fix Released
Critical
David Ames

Bug Description

As seen in this test run: https://solutions.qa.canonical.com/testruns/testRun/4cd6c458-a775-4735-9bca-64f607453c40
Crashdump: https://oil-jenkins.canonical.com/artifacts/4cd6c458-a775-4735-9bca-64f607453c40/generated/generated/openstack/juju-crashdump-openstack-2021-02-15-21.38.05.tar.gz
Bundle: https://oil-jenkins.canonical.com/artifacts/4cd6c458-a775-4735-9bca-64f607453c40/generated/generated/openstack/bundle.yaml
Full Artifacts: https://oil-jenkins.canonical.com/artifacts/4cd6c458-a775-4735-9bca-64f607453c40/index.html

In this deployment placement/2 attempts to perform a database migration and fails. However the database is reporting that it is online, and placement's peers have processed the hook properly. There may be a race here, or the placement-mysql-router has fed bad information to the placement charm and the hook took too long to execute for it to update? It's hard to tell.

juju status:

mysql-innodb-cluster/0* active executing 0/lxd/6 10.246.64.243 Unit is ready: Mode: R/W
mysql-innodb-cluster/1 active idle 2/lxd/7 10.246.64.237 Unit is ready: Mode: R/O
mysql-innodb-cluster/2 active idle 4/lxd/8 10.246.65.23 Unit is ready: Mode: R/O
...
placement/0 active idle 0/lxd/9 10.244.8.158 8778/tcp Unit is ready
  hacluster-placement/2 active idle 10.244.8.158 Unit is ready and clustered
  logrotated/36 active idle 10.244.8.158 Unit is ready.
  placement-mysql-router/2 active idle 10.244.8.158 Unit is ready
  public-policy-routing/25 active idle 10.244.8.158 Unit is ready
placement/1 active idle 2/lxd/10 10.244.8.149 8778/tcp Unit is ready
  hacluster-placement/1 active idle 10.244.8.149 Unit is ready and clustered
  logrotated/20 active idle 10.244.8.149 Unit is ready.
  placement-mysql-router/1 active idle 10.244.8.149 Unit is ready
  public-policy-routing/12 active idle 10.244.8.149 Unit is ready
placement/2* error idle 4/lxd/11 10.244.8.141 8778/tcp hook failed: "identity-service-relation-changed"
  hacluster-placement/0* active idle 10.244.8.141 Unit is ready and clustered
  logrotated/7 active idle 10.244.8.141 Unit is ready.
  placement-mysql-router/0* blocked idle 10.244.8.141 Failed to connect to MySQL
  public-policy-routing/1 active idle 10.244.8.141 Unit is ready

placement/2 juju agent log:

021-02-15 21:34:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for placement/2-identity-service-relation-changed-6358985528794214749
2021-02-15 21:34:53 ERROR juju-log identity-service:165: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-placement-2/charm/reactive/placement_handlers.py", line 72, in init_db
    placement_charm.db_migrate()
  File "lib/charm/openstack/placement.py", line 92, in db_migrate
    subprocess.check_call(self.migrate_cmd)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/share/placement/mysql-migrate-db.sh', '--migrate', '--skip-locks', '/etc/placement/migrate-db.rc']' returned non-zero exit status 5.

2021-02-15 21:34:53 WARNING identity-service-relation-changed Traceback (most recent call last):
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "/var/lib/juju/agents/unit-placement-2/charm/hooks/identity-service-relation-changed", line 22, in <module>
2021-02-15 21:34:53 WARNING identity-service-relation-changed main()
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
2021-02-15 21:34:53 WARNING identity-service-relation-changed bus.dispatch(restricted=restricted_mode)
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
2021-02-15 21:34:53 WARNING identity-service-relation-changed _invoke(other_handlers)
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
2021-02-15 21:34:53 WARNING identity-service-relation-changed handler.invoke()
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "/var/lib/juju/agents/unit-placement-2/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
2021-02-15 21:34:53 WARNING identity-service-relation-changed self._action(*args)
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "/var/lib/juju/agents/unit-placement-2/charm/reactive/placement_handlers.py", line 72, in init_db
2021-02-15 21:34:53 WARNING identity-service-relation-changed placement_charm.db_migrate()
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "lib/charm/openstack/placement.py", line 92, in db_migrate
2021-02-15 21:34:53 WARNING identity-service-relation-changed subprocess.check_call(self.migrate_cmd)
2021-02-15 21:34:53 WARNING identity-service-relation-changed File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
2021-02-15 21:34:53 WARNING identity-service-relation-changed raise CalledProcessError(retcode, cmd)
2021-02-15 21:34:53 WARNING identity-service-relation-changed subprocess.CalledProcessError: Command '['/usr/share/placement/mysql-migrate-db.sh', '--migrate', '--skip-locks', '/etc/placement/migrate-db.rc']' returned non-zero exit status 5.

Tags: cdo-qa
Revision history for this message
Marian Gasparovic (marosg) wrote :
Revision history for this message
David Ames (thedac) wrote :

Analysis:

The mysql-router charm's change [0] to handle mysql-router 8.0.23's change in behavior introduced a service restart when it gets TLS information over the relation from mysql-innodb-cluster. It seems this is causing the fault.

      placement/2:
        workload-status:
          current: error
          message: 'hook failed: "identity-service-relation-changed" for keystone:identity-service'
          since: 15 Feb 2021 21:34:53Z
        juju-status:
          current: idle
          since: 15 Feb 2021 21:34:53Z
          version: 2.8.8
        leader: true
        machine: 4/lxd/11
        open-ports:
        - 8778/tcp
        public-address: 10.244.8.141
        subordinates:

        <snip>

          placement-mysql-router/0:
            workload-status:
              current: blocked
              message: Failed to connect to MySQL
              since: 15 Feb 2021 21:21:47Z

At 21:21:46 the mysql-router service restarts

Feb 15 21:21:46 juju-3585b1-4-lxd-11 systemd[1]: placement-mysql-router.service: Succeeded.
Feb 15 21:21:46 juju-3585b1-4-lxd-11 start.sh[36704]: PID 36704 written to '/var/lib/mysql/placement-mysql-router/mysqlrouter.pid'

At which point it is unable to access the database cluster, and it never seems to recover.

2021-02-15 21:21:47 DEBUG juju-log db-router:182: Opening db connection for mysqlrouteruser@127.0.0.1
2021-02-15 21:21:47 ERROR juju-log db-router:182: Failed to connect to database due to '(2003, "Can't connect to MySQL server on '127.0.0.1' (111)")'
2021-02-15 21:21:47 DEBUG juju-log db-router:182: Could not connect to db

Placement believes it has all of the information required and attempts the db migration which fails.

It is as yet unclear to me what happened that caused the mysql-router to lose the ability to reach the database cluster after the restart. I would need to see the contents of /var/lib/mysql/placement-mysql-router to determine that. If we have a live system available the contents of /var/lib/mysql/<name>-mysql-router would be helpful.

Next steps:

1) Attempt to reproduce and inspect /var/lib/mysql/<name>-mysql-router
2) Consider informing the "client" service via the shared-db relation when the mysql-router cannot communicate with the database cluster.

[0] https://github.com/openstack/charm-mysql-router/commit/5245a1592a0d67c03ad070ff0d634951a3bdd070

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Changed in charm-placement:
status: New → Confirmed
Revision history for this message
Michael Skalka (mskalka) wrote :

Another instance of this here: https://solutions.qa.canonical.com/testruns/testRun/7e7fb519-5b66-4557-a230-02b58280d370

@David How large is that /var/lib/mysql/placement-mysql-router file? We can add it to our log collection if it's not massive.

Revision history for this message
David Ames (thedac) wrote :

@Michael,

It is a directory. In a test deploy it is only 68K. It should be fine to add to the jujucrashdump.

I am running tests in a loop trying to capture this bug.

tags: added: cdo-qa
Revision history for this message
Michael Skalka (mskalka) wrote :

@David

Ack, we added it to juju-crashdump. As soon as we can rebuild the snap it should get picked up in future runs.

Revision history for this message
David Ames (thedac) wrote :

I believe I have caught an example of this. The router is stuck after a metadata change from the cluster. In this case when we update TLS certificates to use vault certificates and the cluster nodes do a rolling restart.

2021-02-18 22:41:04 metadata_cache INFO [7fb9c9d7c700] Potential changes detected in cluster 'jujuCluster' after metadata refresh
2021-02-18 22:41:04 metadata_cache INFO [7fb9c9d7c700] Metadata for cluster 'jujuCluster' has 1 replicasets:
2021-02-18 22:41:04 metadata_cache INFO [7fb9c9d7c700] 'default' (3 members, single-primary)
2021-02-18 22:41:04 metadata_cache INFO [7fb9c9d7c700] 10.5.0.4:3306 / 33060 - mode=RW
2021-02-18 22:41:04 metadata_cache INFO [7fb9c9d7c700] 10.5.0.26:3306 / 33060 - mode=RO
2021-02-18 22:41:04 metadata_cache INFO [7fb9c9d7c700] 10.5.0.35:3306 / 33060 - mode=RO
2021-02-18 22:41:04 routing INFO [7fb9c9d7c700] Routing routing:jujuCluster_x_ro listening on 3309 and named socket /var/lib/mysql/vault-mysql-router/mysqlxro.sock got request to disconnect invalid connections: metadata change
2021-02-18 22:41:04 routing INFO [7fb9c9d7c700] Routing routing:jujuCluster_x_rw listening on 3308 and named socket /var/lib/mysql/vault-mysql-router/mysqlx.sock got request to disconnect invalid connections: metadata change
2021-02-18 22:41:04 routing INFO [7fb9c9d7c700] Routing routing:jujuCluster_rw listening on 3306 and named socket /var/lib/mysql/vault-mysql-router/mysql.sock got request to disconnect invalid connections: metadata change
2021-02-18 22:41:04 routing INFO [7fb9c9d7c700] Routing routing:jujuCluster_ro listening on 3307 and named socket /var/lib/mysql/vault-mysql-router/mysqlro.sock got request to disconnect invalid connections: metadata change
2021-02-18 22:41:04 metadata_cache WARNING [7fb9cbd80700] Instance '10.5.0.26:3306' [d43c3cfb-7235-11eb-9682-fa163e85e1ed] of replicaset 'default' is unreachable. Increasing metadata cache refresh frequency.

The mysql-router charm get's the following error when it tries to check its connectivity to the cluster:

MySQLdb._exceptions.OperationalError: (2013, "Lost connection to MySQL server at 'waiting for initial communication packet', system error: 110")

It will fail in this way indefinitely, until the service is restarted. Once restarted it returns to normal operation.

WORK AROUND

Restart the mysql-router service

TRIAGE:

1) MySQL Router has several metadata configuration options [1]. Specifically, ttl whose default changed in the most recent version. Investigate, how the following interact and see if they help in this case:

auth_cache_refresh_interval Time between auth-cache refresh attempts
auth_cache_ttl Time until the cache becomes invalid if not refreshed
ttl Time To Live, in seconds

2) If those do not resolve, the next step is to make the restart on changed retry until a validated connection to the cluster occurs.

3) The mysql-router charm should inform its shared-db client when it cannot access the DB. Care must be taken not to create thrash.

[0] https://dev.mysql.com/doc/mysql-router/8.0/en/mysql-router-conf-options.html

Changed in charm-placement:
status: Confirmed → Triaged
importance: Undecided → Critical
assignee: nobody → David Ames (thedac)
milestone: none → 21.04
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
David Ames (thedac) wrote :

WIP PR [0] starts the process of allowing the Metadata Cache Options to be updated. The next step is to test if changes to these values do, in fact, help fix this bug.

[0] https://review.opendev.org/c/openstack/charm-mysql-router/+/776764
[1] https://dev.mysql.com/doc/mysql-router/8.0/en/mysql-router-conf-options.html

Changed in charm-placement:
status: Triaged → In Progress
Revision history for this message
Michael Skalka (mskalka) wrote :

Marking this crit as it's impacting SQA stable releases as well as the FCE release gate with no know workaround for our CI.

Revision history for this message
David Ames (thedac) wrote :

Taking a two pronged attack at this problem. The first [0] is the ability to change time to live values. By setting ttl=5 seconds (whic) was the previous default), it significantly reduced the number of times we hit the problem.

The second [1] [2] is to retry on restart when we get the telltale mysql.MySQLdb._exceptions.OperationalError, 2013 "Lost connection to MySQL server at 'waiting for initial communication packet'.

[0] https://review.opendev.org/c/openstack/charm-mysql-router/+/776764
[1] https://review.opendev.org/c/openstack/charms.openstack/+/778332
[2] https://review.opendev.org/c/openstack/charm-mysql-router/+/778333

Revision history for this message
David Ames (thedac) wrote :

The collapsed solution [0] has landed. The other PRs were abandoned for the collapsed solution.

Please validate with cs:~openstack-charmers-next/mysql-router

[0] https://review.opendev.org/c/openstack/charm-mysql-router/+/776764

Changed in charm-placement:
status: In Progress → Fix Committed
Revision history for this message
Billy Olsen (billy-olsen) wrote :

With the work-around provided to use the mysql-router charm in the openstack-charmers-next branch, I am moving this from field-critical to field-high.

Revision history for this message
Alexander Balderson (asbalderson) wrote :
Download full text (4.0 KiB)

I feel like ive bumped into this issue again on a heat unit when using the charmers-next mysql-router

heat unit 1 went into error while trying to migrate the database:

2021-03-28 06:21:23 INFO juju-log shared-db:146: Migrating the heat database.
2021-03-28 06:21:26 DEBUG shared-db-relation-changed 2021-03-28 06:21:26.752 87238 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 10 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))")^[[00m
2021-03-28 06:21:36 DEBUG shared-db-relation-changed 2021-03-28 06:21:36.770 87238 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 9 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))")^[[00m

2021-03-28 06:23:09 WARNING shared-db-relation-changed ERROR: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))")
2021-03-28 06:23:09 WARNING shared-db-relation-changed (Background on this error at: http://sqlalche.me/e/e3q8)
2021-03-28 06:23:09 DEBUG shared-db-relation-changed OpenStack Release: {}
2021-03-28 06:23:09 WARNING shared-db-relation-changed Traceback (most recent call last):
2021-03-28 06:23:09 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-heat-1/charm/hooks/shared-db-relation-changed", line 470, in <module>
2021-03-28 06:23:09 WARNING shared-db-relation-changed main()
2021-03-28 06:23:09 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-heat-1/charm/hooks/shared-db-relation-changed", line 463, in main
2021-03-28 06:23:09 WARNING shared-db-relation-changed hooks.execute(sys.argv)
2021-03-28 06:23:09 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-heat-1/charm/hooks/charmhelpers/core/hookenv.py", line 945, in execute
2021-03-28 06:23:09 WARNING shared-db-relation-changed self._hooks[hook_name]()
2021-03-28 06:23:09 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-heat-1/charm/hooks/charmhelpers/core/host.py", line 725, in wrapped_f
2021-03-28 06:23:09 WARNING shared-db-relation-changed return restart_on_change_helper(
2021-03-28 06:23:09 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-heat-1/charm/hooks/charmhelpers/core/host.py", line 749, in restart_on_change_helper
2021-03-28 06:23:09 WARNING shared-db-relation-changed r = lambda_f()
2021-03-28 06:23:09 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-heat-1/charm/hooks/charmhelpers/core/host.py", line 726, in <lambda>
2021-03-28 06:23:09 WARNING shared-db-relation-changed (lambda: f(*args, **kwargs)), restart_map, stopstart,
2021-03-28 06:23:09 WARNING shared-db-relation-changed File "/var/lib/juju/agents/unit-heat-1/charm...

Read more...

Revision history for this message
David Ames (thedac) wrote :

Alexander,

Hi, after landing the changes to mysql-router for this bug it reveled the race condition LP Bug #1918953 [0] which had a fix land on 2021-04-02.

Can you please test with both the next version of mysql-router and the next version of mysql-innodb-cluster? I believe with both solutions in place the race condition has been resolved.

[0] https://bugs.launchpad.net/charm-mysql-router/+bug/1918953

Changed in charm-placement:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.