mysql unit goes to error with upgrade-version-a-relation-changed

Bug #2049857 reported by Marian Gasparovic
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Snap
Triaged
Critical
Unassigned

Bug Description

During sunbeam cluster resize mysql goes into error reporting
"upgrade-version-a-relation-changed"

One of several test runs where we saw this

https://oil-jenkins.canonical.com/artifacts/51644b5d-71f0-4524-8b63-1d6a76b9a56c/index.html

Revision history for this message
Carl Csaposs (carlcsaposs) wrote :

Regarding mysql-router-k8s, the `upgrade-version-a-relation-changed` error is (from searching for "Uncaught exception while in charm code" in the logs)

`mysqlsh.DBError: MySQL Error (2003): Shell.connect: Can\'t connect to MySQL server on \'cinder-mysql-primary.openstack.svc.cluster.local:3306\' (111)`

mysql-router-k8s rev 86+ improves the UX for this situation (https://github.com/canonical/mysql-router-k8s-operator/pull/190)

It appears that the issue here is something with MySQL Server—since even at the end of the logs, MySQL Router is unable to connect to MySQL Server

Revision history for this message
James Page (james-page) wrote :

At the point in time that the deployment was declared a failure that does indeed seem to be the case:

cinder-mysql/0* maintenance idle 10.1.239.109 offline
cinder-mysql/1 active idle 10.1.172.219
cinder-mysql/2 waiting idle 10.1.109.155 waiting to get cluster primary from peers

James Page (james-page)
Changed in snap-openstack:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Every time I get this issue, the mysql itself appears as offline (but not all routers go into error).

heat-mysql-router/0* error idle 10.1.3.232 hook failed: "update-status"
keystone-mysql-router/0* error idle 10.1.3.211 hook failed: "update-status"
mysql/0* maintenance idle 10.1.3.224 offline

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Is this being worked on? I ask because the "Assigned to:" is still "Unassigned". This is very critical and AFAIK is preventing any kind of installation at the moment.

Revision history for this message
Paulo Machado (paulomachado) wrote :

Hi, yes. I need to test a fix with a full sunbeam deployment, but I'm not sure how to do it.
I'll DM you Andre

Revision history for this message
Paulo Machado (paulomachado) wrote :

Update on this - this juju bug https://bugs.launchpad.net/juju/+bug/2052517 talks about the issue we are seeing.

Though there are some improvements to be done in the charm side to make it more robust, and I'll be working on that.

tags: added: canonical-data-platform-eng
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

How is the official status of this work? I know changes to relax operator aliveness check were discussed but I'm not sure if they were actually implemented and, if so, actually released. If yes, in which release?

Anyway, I'm not seeing those errors anymore for a while now (probably because it fixed and the fix is working -- but wanted to check on that).

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Just a ping to keep this alive. I'm actually not seeing this on a regular basis anymore but I'm still not sure what was the resolution of what is the current state in case it is still in progress.

Revision history for this message
James Page (james-page) wrote :

@andre-ruiz - as a bit of an update - some deadlock issues in pebble seem to be the cause of this problem.

Fixes have been landed into pebble and initial testing by the data platform team indicate this has much improved the situation - I need to find out when those will be present in the stable channel of the Juju tracks we're using (3.2 and 3.4 currently).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.