remove-instance force=true only works if mysql service is not running

Bug #2006759 reported by Rodrigo Barbieri
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Triaged
High
Unassigned

Bug Description

On a fresh jammy 3-unit deployment using charm revision 39 of 8.0/stable channel, where I had attempted to remove an instance (hit bug LP#1954306) and was trying to add it back, the removed instance eventually got stuck in the following state:

{"address": "10.5.3.85:3306",
      "instanceErrors": ["ERROR: GR Recovery channel receiver stopped with an error:
      Fatal error: Invalid (empty) username when attempting to connect to the master
      server. Connection attempt terminated. (13117) at 2023-02-09 15:42:58.656640"],
      "mode": "R/O", "readReplicas": {}, "recovery": {"receiverError": "Fatal error:
      Invalid (empty) username when attempting to connect to the master server. Connection
      attempt terminated.", "receiverErrorNumber": 13117, "state": "CONNECTION_ERROR"}

now, attempting to remove this instance from the cluster results in error:

output: "Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\n\e[31mERROR:
      \e[0m10.5.3.85:3306 is reachable but has state ERROR\nTo safely remove it from
      the cluster, it must be brought back ONLINE. If not possible, use the 'force'
      option to remove it anyway.\nTraceback (most recent call last):\n File \"<string>\",
      line 3, in <module>\nmysqlsh.Error: Shell Error (51004): Cluster.remove_instance:
      Instance is not ONLINE and cannot be safely removed\n"

I retried this on a fresh new deployment where the instance is not in an error state, after trying to remove the instance and hitting bug LP#1954306 again, where now the instance is offline, the error message on force=true is:

    output: "Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\n\e[36mNOTE:
      \e[0m10.5.2.0:3306 is reachable but has state OFFLINE\nThe instance will be
      removed from the InnoDB cluster. Depending on the instance\nbeing the Seed or
      not, the Metadata session might become invalid. If so, please\nstart a new session
      to the Metadata Storage R/W instance.\n\nmysqlsh: /build/mysql-shell/parts/mysql-shell/src/modules/adminapi/cluster/cluster_impl.cc:1831:
      std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
      >, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
      >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
      > > >, bool> mysqlsh::dba::Cluster_impl::get_replication_user(const mysqlshdk::mysql::IInstance&)
      const: Assertion `!recovery_user.empty()' failed.\n"

The only alternative to finally remove the instance is to ssh to the instance, stop the mysql service, and then retry with force=true.

Also, the only situation where force=true works, is if the instance is fine (has ONLINE status), however, in this case force=true should not be needed, as it hits bug LP#1983158

I suppose that force=true should force the removal regardless of the instance's state and would not fail, and should not need the mysql service to be offline. If this is expected, then this should be documented in the action description for the force parameter.

Tags: sts
tags: added: sts
description: updated
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Triaging to high; the charm action ought to do what it says it should do, even if that requires the action to perform additional checks beforehand. The problem is that the action is being run on the unit that isn't being removed, which means that it's sort of at the mercy of what mysql-shell is capable of w.r.t. other instances of mysql running on the other units. i.e. if the other unit's mysql isn't running and the mysql-shell command fails, then there's not a lot we can do (I suspect?)

However, perhaps we could be better at documenting what to do under which scenarios and ensure that we provide actions to resolve those situations.

Changed in charm-mysql-innodb-cluster:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

@Alex: Thanks for triaging this. This is one of the usability issues I am trying to address in [1] as well (see the actions.yaml change). I found out that it is better to always use force=true to prevent a snowball of errors that are way more difficult to get sorted, and then it only causes the SUPER_READ_ONLY issue which is easier to address.

[1] https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/875041

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.