kolla-ansible

Bug #2002465
Comment #3

Comment 3 for bug 2002465

Revision history for this message

Boris Lukashev (rageltman) wrote on 2023-05-07 (last edit on 2023-05-07):

I am seeing the same thing against 22.04 - mariadbd listens on the node's IP port 3306 when bootstrapped on initial node, but subsequent nodes only listen on 3306 from haproxy on the shared VIP - mariadbd never gets to a listening state, even if restarting the container.
Path MTU checks out in my cluster, no network errors, and i have an identical xena setup on 20.04 running next to it - pretty sure its not the network.

The slaves don't complete their SST pull:
```
230507 13:41:37 mysqld_safe WSREP: Running position recovery with --disable-log-error --pid-file='/var/lib/mysql//svsc-osm01-recover.pid'
2023-05-07 13:41:38 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.217.122.11:4444' --datadir '/var/lib/mysql/' --parent 231 --progress 0 --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/mysql/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20230507 13:46:39.211)
WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20230507 13:46:39.214)

```
but no error is reported on the first node bootrapped (the supposed donor). Once the two slaves fail, ansible restarts the master, and it spits out:
```
2023-05-07 14:05:41 0 [Warning] WSREP: No re-merged primary component found.
2023-05-07 14:05:41 0 [Warning] WSREP: No bootstrapped primary component found.
2023-05-07 14:05:41 0 [ERROR] WSREP: ./gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
```
resulting in a dead deployment - attempting to recover mariadb has no effect.
It almost looks like there _isnt any data to replicate_ to begin with on the bootstrapping node...

On Xena/20.04, we had to set enable_mariadb_clustercheck: "yes" for ubuntu to work properly; here, it has no effect.

The failure is occurring within mariadb's cluster setup in master->slave replication, despite the metal upon which these containers are running being able to `ping -M do -s 9072` each other (and their NIC MTUs being between 9000-9100). There are no firewalls running on the hosts to prevent comms between mariadb containers.

The slaves don't complete their SST pull:
```
230507 13:41:37 mysqld_safe WSREP: Running position recovery with --disable-log-error  --pid-file='/var/lib/mysql//svsc-osm01-recover.pid'
2023-05-07 13:41:38 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.217.122.11:4444' --datadir '/var/lib/mysql/' --parent 231 --progress 0 --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/mysql/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20230507 13:46:39.211)
WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20230507 13:46:39.214)

On Xena/20.04, we had to set enable_mariadb_clustercheck: "yes" for ubuntu to work properly; here, it has no effect.