Cluster stays blocked after upgrade to jammy

Bug #2011605 reported by Bas de Bruijne
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Triaged
Undecided
Unassigned

Bug Description

In test run https://solutions.qa.canonical.com/v2/testruns/46610432-1ec4-4e06-b43b-3b73e000d31a, we are upgrading yoga-focal to yoga-jammy. The upgrading of the mysql-innodb-cluster charms is successful but the cluster stays in the blocked state after with the message: "Cluster is inaccessible from this instance. Please check logs for details."

In the logs we see:
--------------
2023-03-14 13:14:11 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
mysqlsh.Error: Shell Error (51314): Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)
--------------

and

--------------
2023-03-14T09:44:18.161421Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.71:33061 when joining a group. My local port is: 33061.'
2023-03-14T09:44:18.161556Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.152:33061 when joining a group. My local port is: 33061.'
2023-03-14T09:44:18.161623Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.71:33061 when joining a group. My local port is: 33061.'
2023-03-14T09:44:18.161632Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 33061'
2023-03-14T09:44:18.239838Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33061'
--------------

More details can be found in the crashdump: https://oil-jenkins.canonical.com/artifacts/46610432-1ec4-4e06-b43b-3b73e000d31a/generated/generated/openstack/juju-crashdump-openstack-2023-03-14-13.10.13.tar.gz

tags: added: cdo-qa foundations-engine
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

mysql8 clustering is a bit sensitive to networking and delays in responses from other nodes. My guess is that everything just took too long and the clustering code (in mysql8, not the charm), just gave up. In that instance it can be quite tricky to recover, but basically you have to pick a node, force it to be the lead in the cluster, and then force the other two back into the cluster.

The logs are full of "Error on opening connect to ..." repeated over and over, indicating that the other node(s) are simply "not there".

Was this a resource constrained system that this was being tested on?

This is essentially very similar to https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1917332, for example, except that, in that case the lead node was still running.

If all the units were upgraded at the same time (series-upgrade) with no settling between them, then this bug is very similar to https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1907202

Otherwise it may be something else - mysql8 is tricky!

If you could provide some more context, please, to how the units are being series upgraded, that would be great. Thanks.

Changed in charm-mysql-innodb-cluster:
status: New → Incomplete
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We use the zaza automation for the series upgrade, specifically zaza.openstack.charm_tests.series_upgrade.parallel_tests.ParallelSeriesUpgradeTest. I can look up the exact steps it takes, but I think you know that better than I do.

This lab is indeed resource contained, but by the time we do the OpenStack upgrade no other service should be running resource-intensive tasks.

Changed in charm-mysql-innodb-cluster:
status: Incomplete → New
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I beginning to wonder if you *can* series upgrade mysql8 automatically. If the minor version (e.g. 8.0.x) changes when the first unit is series upgraded, then it won't rejoin the cluster, and so it may have to wait until all the units have updated before the cluster could be reformed. And I don't think the mysql charm code can handle reforming the cluster from 3 separated units after upgrade; although obviously it can form the cluster in the first place.

I think that series upgrading mysql8 should be removed from automatic testing for the moment until we can get a handle on how it can be handled. I'm setting the bug to triaged, but I don't think anything can be done to 'fix' it until we've investigated it much further.

Changed in charm-mysql-innodb-cluster:
status: New → Triaged
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

This upgrade run is also interesting: https://oil-jenkins.canonical.com/job/fce_openstackupgrade/20019//consoleFull

The innodb-cluster successfully upgrades and the charms look healthy, but then ceph-mon/0 throws this error when it tries to upgrade:

```
Traceback (most recent call last):
  File "./src/charm.py", line 282, in <module>
    main(CephMonCharm)
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 85, in on_pre_series_upgrade
    hooks.pre_series_upgrade()
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/src/ceph_hooks.py", line 1218, in pre_series_upgrade
    set_unit_paused()
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/charmhelpers/contrib/openstack/utils.py", line 1561, in set_unit_paused
    with unitdata.HookData()() as t:
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/charmhelpers/core/unitdata.py", line 464, in __init__
    self.kv = kv()
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/charmhelpers/core/unitdata.py", line 525, in kv
    _KV = Storage()
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/charmhelpers/core/unitdata.py", line 190, in __init__
    self._init()
  File "/var/lib/juju/agents/unit-ceph-mon-0/charm/venv/charmhelpers/core/unitdata.py", line 385, in _init
    self.cursor.execute('''
sqlite3.OperationalError: database is locked
```

The crashdumps for this run are here:
https://oil-jenkins.canonical.com/artifacts/287f98ed-f8fa-4868-a664-0e9dc1b27f44/index.html

Revision history for this message
Billy Olsen (billy-olsen) wrote :

The ceph-mon issue identified is being tracked in bug #2005137

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.