[focal] charm becomes blocked with workload-status "Failed to connect to MySQL"

Bug #1907250 reported by Alex Kavanagh
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MySQL Router Charm
Triaged
High
Unassigned
mysql-8.0 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

In a fully deployed focal-ussuri model, the update-status hook on the mysql-router subordinate can hang meaning that the hook never ends. Due to the way that update-status is not reported in the textual output of "juju status" you can't actually see the problem.

However, any attempt to run "juju run" or actions or any other interaction with the *machine* the that the unit is on will fail, thus locking actions to all of the units on that machine. The payloads continue normally.

I'm still trying to determine the nature of the hang.

summary: - update-status hook hangs (but you can't see it!)
+ [focal-ussuri] update-status hook hangs (but you can't see it!)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote : Re: [focal-ussuri] update-status hook hangs (but you can't see it!)

Note, eventually the hook times out and errors. Then doing juju resolved <unit>/0 for the mysql-router then runs fine. So it's probably a race somewhere that deadlocks in the router unit and it might be around setting data on the relation, after inspecting the code.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

So the problem is really in charmhelpers, in the mysql.py module. The connect method - according to https://github.com/PyMySQL/mysqlclient/blob/401ca8297439d8e34fff0ebda19bf6121de5d2ed/MySQLdb/connections.py#L71 - needs a connect_timeout named parameter otherwise it will hang forever.

The issue is that the local instance for the mysql_router to connect to isn't responding, but no error is being generated. There are two steps to solving the problem:

1. Make is obvious that it is happening - this is by adding a connect_timeout to the connection call and allowing this to generate an exception which can be caught so that the charm correctly goes into the error state.

2. Work out why it's not working and do something about it. It could be due to upgrading the mysql units (during openstack-upgrade) which causes a connection failure/timeout or some other issue. This will become visible and then a fix possible.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :
summary: - [focal-ussuri] update-status hook hangs (but you can't see it!)
+ [focal] charm becomes blocked with workload-status "Failed to connect to
+ MySQL"
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

For future travelers: the mentioned charm-helpers fix [0] is adding a timeout so that the hook errors instead of hanging forever. So the bug isn't fixed per se, but we'll now get a hook error instead of hanging.

[0]: https://github.com/juju/charm-helpers/pull/550

Changed in charm-mysql-router:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

I got this issues too (20.04+Ussuri + 20.10 charms) - it seems one of 3 cluster services would get this state.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Hi Zhanglei Mao, do you have any further details about what might be triggering it on your system? If you retry the hook, it will (almost certainly) clear, so we are still not sure what is causing the problem.

Which version of Juju are you using? Anything unusual about the networking?

Revision history for this message
Marian Gasparovic (marosg) wrote :

Alex, we have been seeing this in QA for some time and we put all occurrences (16 of them) under this
https://bugs.launchpad.net/charm-placement/+bug/1915842

It was always during deployment and it always influenced one of three units which was blocked with
"Failed to connect to MySQL", other two were active and ready

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Hi Marian; the linked bug is marked fixed-committed. Are you still seeing this bug post 2021-03-08 with openstack-charmers-next charms? If so, then we either have something else going on, or the original bug wasn't fixed.

Thanks!

Changed in charm-mysql-router:
importance: Medium → High
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I can consistently reproduce this in my lab when doing an upgrade from charmstore rev6 to charmhub rev15 or rev22. The error message is the same as described in the LP but I don't think the cause is the lack of a timeout

Revision history for this message
Felipe Reyes (freyes) wrote :

I'm seeing this error:

2022-05-03 22:04:39 io INFO [7f91c2e53e00] starting 8 io-threads, using backend 'linux_epoll'
2022-05-03 22:04:39 main ERROR [7f91c2e53e00] Error: option 'DEFAULT.name' is not supported

The error can be reproduced using this bundle http://paste.ubuntu.com/p/mdY9dJhktH/ + the patch recently merged at https://review.opendev.org/c/openstack/charm-mysql-router/+/834359 (probably at this point in time the change should be available in the latest/edge channel).

When I try to generate the bootstrap files that the mysqlrouter creates we can find that the `name` key is added by it and it's not added by the charm, see below:

root@juju-b70e35-0-lxd-6:/var/lib/mysql/vault-mysql-router# /usr/bin/mysqlrouter --user mysql --name keystone-mysql-router --bootstrap mysqlrouteruser:3f4m6w6r2HFjGkfXnbHP3Mr6mphcpxys@10.246.114.60 --direc
tory /var/lib/mysql/keystone-mysql-router --conf-use-sockets --conf-bind-address 127.0.0.1 --conf-base-port 3306 --disable-rest --force
# Bootstrapping MySQL Router instance at '/var/lib/mysql/keystone-mysql-router'...

- Creating account(s) (only those that are needed, if any)
- Verifying account (using it to run SQL queries that would be run by Router)
- Storing account in keyring
- Adjusting permissions of generated files
- Creating configuration /var/lib/mysql/keystone-mysql-router/mysqlrouter.conf

# MySQL Router 'keystone-mysql-router' configured for the InnoDB Cluster 'jujuCluster'

After this MySQL Router has been started with the generated configuration

    $ /usr/bin/mysqlrouter -c /var/lib/mysql/keystone-mysql-router/mysqlrouter.conf

InnoDB Cluster 'jujuCluster' can be reached by connecting to:

## MySQL Classic protocol

- Read/Write Connections: localhost:3306, /var/lib/mysql/keystone-mysql-router/mysql.sock
- Read/Only Connections: localhost:3307, /var/lib/mysql/keystone-mysql-router/mysqlro.sock

## MySQL X protocol

- Read/Write Connections: localhost:3308, /var/lib/mysql/keystone-mysql-router/mysqlx.sock
- Read/Only Connections: localhost:3309, /var/lib/mysql/keystone-mysql-router/mysqlxro.sock

root@juju-b70e35-0-lxd-6:/var/lib/mysql/vault-mysql-router# less /var/lib/mysql/keystone-mysql-router/mysqlrouter.conf
root@juju-b70e35-0-lxd-6:/var/lib/mysql/vault-mysql-router# grep -B1 name /var/lib/mysql/keystone-mysql-router/mysqlrouter.conf
[DEFAULT]
name=keystone-mysql-router

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mysql-8.0 (Ubuntu):
status: New → Confirmed
Revision history for this message
Liam Young (gnuoy) wrote :

One of the causes of a charm going into a "Failed to connect to MySQL" state is that a connection to the database failed when the db-router charm attempted to restart the db-router service. Currently the charm will only retry the connection in response to one return code from the mysql. The return code is 2013 which is "Message: Lost connection to MySQL server during query" *1. However, if the connection fails to be established in the first place then the error returned is 2003 "Can't connect to MySQL server on...".

*1 https://dev.mysql.com/doc/mysql-errors/8.0/en/client-error-reference.html

Revision history for this message
Liam Young (gnuoy) wrote :
tags: added: cdo-qa foundations-engine
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.