Bug #2002465 “Deployment fails with MariaDB on kolla-ansible 15....” : Bugs : kolla-ansible

Revision history for this message

Jose Gaitan (vchjgaitan) wrote on 2023-01-11:

#1

Included logs, globals and inventory Edit (65.2 KiB, text/plain)

Revision history for this message

Michal Nasiadka (mnasiadka) wrote on 2023-01-31:

#2

A lot of connection timeouts in MariaDB logs - please check your network.

Changed in kolla-ansible:
status:	New → Invalid

Revision history for this message

Boris Lukashev (rageltman) wrote on 2023-05-07 (last edit on 2023-05-07):

#3

I am seeing the same thing against 22.04 - mariadbd listens on the node's IP port 3306 when bootstrapped on initial node, but subsequent nodes only listen on 3306 from haproxy on the shared VIP - mariadbd never gets to a listening state, even if restarting the container.
Path MTU checks out in my cluster, no network errors, and i have an identical xena setup on 20.04 running next to it - pretty sure its not the network.

The slaves don't complete their SST pull:
```
230507 13:41:37 mysqld_safe WSREP: Running position recovery with --disable-log-error --pid-file='/var/lib/mysql//svsc-osm01-recover.pid'
2023-05-07 13:41:38 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.217.122.11:4444' --datadir '/var/lib/mysql/' --parent 231 --progress 0 --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/mysql/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20230507 13:46:39.211)
WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20230507 13:46:39.214)

```
but no error is reported on the first node bootrapped (the supposed donor). Once the two slaves fail, ansible restarts the master, and it spits out:
```
2023-05-07 14:05:41 0 [Warning] WSREP: No re-merged primary component found.
2023-05-07 14:05:41 0 [Warning] WSREP: No bootstrapped primary component found.
2023-05-07 14:05:41 0 [ERROR] WSREP: ./gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
```
resulting in a dead deployment - attempting to recover mariadb has no effect.
It almost looks like there _isnt any data to replicate_ to begin with on the bootstrapping node...

On Xena/20.04, we had to set enable_mariadb_clustercheck: "yes" for ubuntu to work properly; here, it has no effect.

The failure is occurring within mariadb's cluster setup in master->slave replication, despite the metal upon which these containers are running being able to `ping -M do -s 9072` each other (and their NIC MTUs being between 9000-9100). There are no firewalls running on the hosts to prevent comms between mariadb containers.

I am seeing the same thing against 22.04 - mariadbd listens on the node's IP port 3306 when bootstrapped on initial node, but subsequent nodes only listen on 3306 from haproxy on the shared VIP - mariadbd never gets to a listening state, even if restarting the container.
Path MTU checks out in my cluster, no network errors, and i have an identical xena setup on 20.04 running next to it - pretty sure its not the network.

The slaves don't complete their SST pull:
```
230507 13:41:37 mysqld_safe WSREP: Running position recovery with --disable-log-error  --pid-file='/var/lib/mysql//svsc-osm01-recover.pid'
2023-05-07 13:41:38 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.217.122.11:4444' --datadir '/var/lib/mysql/' --parent 231 --progress 0 --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/mysql/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20230507 13:46:39.211)
WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20230507 13:46:39.214)

```
but no error is reported on the first node bootrapped (the supposed donor). Once the two slaves fail, ansible restarts the master, and it spits out:
```
2023-05-07 14:05:41 0 [Warning] WSREP: No re-merged primary component found.
2023-05-07 14:05:41 0 [Warning] WSREP: No bootstrapped primary component found.
2023-05-07 14:05:41 0 [ERROR] WSREP: ./gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
```
resulting in a dead deployment - attempting to recover mariadb has no effect.
It almost looks like there _isnt any data to replicate_ to begin with on the bootstrapping node...

On Xena/20.04, we had to set enable_mariadb_clustercheck: "yes" for ubuntu to work properly; here, it has no effect.

The failure is occurring within mariadb's cluster setup in master->slave replication, despite the metal upon which these containers are running being able to `ping -M do -s 9072` each other (and their NIC MTUs being between 9000-9100). There are no firewalls running on the hosts to prevent comms between mariadb containers.

Revision history for this message

Boris Lukashev (rageltman) wrote on 2023-05-07 (last edit on 2023-05-07):

#4

Download full text (3.7 KiB)

Digging further, it does appear that the initially bootstrapped master is broken somehow:
```
2023-05-07 15:06:46 0 [Warning] WSREP: Quorum: No node with complete state:

Version : 6
Flags : 0x1
Protocols : 2 / 10 / 4
State : NON-PRIMARY
Desync count : 0
Prim state : NON-PRIMARY
Prim UUID : 00000000-0000-0000-0000-000000000000
Prim seqno : -1
First seqno : -1
Last seqno : -1
Commit cut : -1
Last vote : -1.0
Vote policy : 0
Prim JOINED : 0
State UUID : c91ea93c-ece8-11ed-bcbd-d37542ae8d5e
Group UUID : 00000000-0000-0000-0000-000000000000
Name : 'mgr02'
Incoming addr: '10.217.122.12:3306'

Version : 6
Flags : 0x2
Protocols : 2 / 10 / 4
State : NON-PRIMARY
Desync count : 0
Prim state : SYNCED
Prim UUID : 5d5b432a-ece8-11ed-9285-be6bf8d043bf
Prim seqno : 4
First seqno : 1
Last seqno : 4
Commit cut : 0
Last vote : -1.0
Vote policy : 0
Prim JOINED : 1
State UUID : c91ea93c-ece8-11ed-bcbd-d37542ae8d5e
Group UUID : 78a25580-ece7-11ed-863d-5af2d3394150
Name : 'mgr00'
Incoming addr: '10.217.122.10:3306'
```
^^ is what happens after the replication error on initial bootstrap of a slave, and that slave is restarted

The donor, being restarted after the two slaves fail to come up, is aware of them but can't figure out which one is primary:
```
023-05-07 20:47:51 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6f57006e-ed18-11ed-9165-5f46e7399b53 from 1 (mgr00)
2023-05-07 20:47:51 0 [Warning] WSREP: Quorum: No node with complete state:

Version : 6
Flags : 0x1
Protocols : 2 / 10 / 4
State : NON-PRIMARY
Desync count : 0
Prim state : PRIMARY
Prim UUID : cc9462f9-ed16-11ed-8544-d3da5d985617
Prim seqno : 5
First seqno : -1
Last seqno : 5
Commit cut : 0
Last vote : -1.0
Vote policy : 0
Prim JOINED : 1
State UUID : 6f57006e-ed18-11ed-9165-5f46e7399b53
Group UUID : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
Name : 'mgr01'
Incoming addr: '10.217.122.11:3306'

Version : 6
Flags : 00
Protocols : 2 / 10 / 4
State : NON-PRIMARY
Desync count : 0
Prim state : NON-PRIMARY
Prim UUID : 00000000-0000-0000-0000-000000000000
Prim seqno : -1
First seqno : 1
Last seqno : 5
Commit cut : 5
Last vote : -1.0
Vote policy : 0
Prim JOINED : 0
State UUID : 6f57006e-ed18-11ed-9165-5f46e7399b53
Group UUID : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
Name : 'mgr00'
Incoming addr: '10.217.122.10:3306'

Version : 6
Flags : 00
Protocols : 2 / 10 / 4
State : NON-PRIMARY
Desync count : 0
Prim state : PRIMARY
Prim UUID : cc9462f9-ed16-11ed-8544-d3da5d985617
Prim seqno : 5
First seqno : -1
Last seqno : 5
Commit cut : 0
Last vote : -1.0
Vote policy : 0
Prim JOINED : 1
State UUID : 6f57006e-ed18-11ed-9165-5f46e7399b53
Group UUID : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
Name : 'mgr02'
Incoming addr: '10.217.122.12:3306'

2023-05-07 20:47:51 0 [Warning] WSREP: No re-merged primary component found.
2023-05-07 20:47:51 0 [Warning] WSREP: No bootstrapped primary component found....

Digging further, it does appear that the initially bootstrapped master is broken somehow:
```
2023-05-07 15:06:46 0 [Warning] WSREP: Quorum: No node with complete state:

Version      : 6
	Flags        : 0x1
	Protocols    : 2 / 10 / 4
	State        : NON-PRIMARY
	Desync count : 0
	Prim state   : NON-PRIMARY
	Prim UUID    : 00000000-0000-0000-0000-000000000000
	Prim  seqno  : -1
	First seqno  : -1
	Last  seqno  : -1
	Commit cut   : -1
	Last vote    : -1.0
	Vote policy  : 0
	Prim JOINED  : 0
	State UUID   : c91ea93c-ece8-11ed-bcbd-d37542ae8d5e
	Group UUID   : 00000000-0000-0000-0000-000000000000
	Name         : 'mgr02'
	Incoming addr: '10.217.122.12:3306'

Version      : 6
	Flags        : 0x2
	Protocols    : 2 / 10 / 4
	State        : NON-PRIMARY
	Desync count : 0
	Prim state   : SYNCED
	Prim UUID    : 5d5b432a-ece8-11ed-9285-be6bf8d043bf
	Prim  seqno  : 4
	First seqno  : 1
	Last  seqno  : 4
	Commit cut   : 0
	Last vote    : -1.0
	Vote policy  : 0
	Prim JOINED  : 1
	State UUID   : c91ea93c-ece8-11ed-bcbd-d37542ae8d5e
	Group UUID   : 78a25580-ece7-11ed-863d-5af2d3394150
	Name         : 'mgr00'
	Incoming addr: '10.217.122.10:3306'
```
^^ is what happens after the replication error on initial bootstrap of a slave, and that slave is restarted

The donor, being restarted after the two slaves fail to come up, is aware of them but can't figure out which one is primary:
```
023-05-07 20:47:51 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6f57006e-ed18-11ed-9165-5f46e7399b53 from 1 (mgr00)
2023-05-07 20:47:51 0 [Warning] WSREP: Quorum: No node with complete state:

Version      : 6
	Flags        : 0x1
	Protocols    : 2 / 10 / 4
	State        : NON-PRIMARY
	Desync count : 0
	Prim state   : PRIMARY
	Prim UUID    : cc9462f9-ed16-11ed-8544-d3da5d985617
	Prim  seqno  : 5
	First seqno  : -1
	Last  seqno  : 5
	Commit cut   : 0
	Last vote    : -1.0
	Vote policy  : 0
	Prim JOINED  : 1
	State UUID   : 6f57006e-ed18-11ed-9165-5f46e7399b53
	Group UUID   : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
	Name         : 'mgr01'
	Incoming addr: '10.217.122.11:3306'

Version      : 6
	Flags        : 00
	Protocols    : 2 / 10 / 4
	State        : NON-PRIMARY
	Desync count : 0
	Prim state   : NON-PRIMARY
	Prim UUID    : 00000000-0000-0000-0000-000000000000
	Prim  seqno  : -1
	First seqno  : 1
	Last  seqno  : 5
	Commit cut   : 5
	Last vote    : -1.0
	Vote policy  : 0
	Prim JOINED  : 0
	State UUID   : 6f57006e-ed18-11ed-9165-5f46e7399b53
	Group UUID   : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
	Name         : 'mgr00'
	Incoming addr: '10.217.122.10:3306'

Version      : 6
	Flags        : 00
	Protocols    : 2 / 10 / 4
	State        : NON-PRIMARY
	Desync count : 0
	Prim state   : PRIMARY
	Prim UUID    : cc9462f9-ed16-11ed-8544-d3da5d985617
	Prim  seqno  : 5
	First seqno  : -1
	Last  seqno  : 5
	Commit cut   : 0
	Last vote    : -1.0
	Vote policy  : 0
	Prim JOINED  : 1
	State UUID   : 6f57006e-ed18-11ed-9165-5f46e7399b53
	Group UUID   : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
	Name         : 'mgr02'
	Incoming addr: '10.217.122.12:3306'

2023-05-07 20:47:51 0 [Warning] WSREP: No re-merged primary component found.
2023-05-07 20:47:51 0 [Warning] WSREP: No bootstrapped primary component found.
2023-05-07 20:47:51 0 [ERROR] WSREP: ./gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
2023-05-07 20:47:51 0 [Note] WSREP: Quorum results:
	version    = 6,
	component  = NON-PRIMARY,
	conf_id    = -1,
	members    = 0/3 (joined/total),
	act_id     = -1,
	last_appl. = 5,
	protocols  = -1/-1/-1 (gcs/repl/appl),
	vote policy= 1,
	group UUID = 00000000-0000-0000-0000-000000000000
2023-05-07 20:47:51 0 [Note] WSREP: Flow-control interval: [28, 28]
2023-05-07 20:47:51 0 [Note] WSREP: Received NON-PRIMARY.
```

This happens whether haproxy is enabled or not, ditto proxysql.

Revision history for this message

Boris Lukashev (rageltman) wrote on 2023-05-09 (last edit on 2023-05-09):

#5

Further down the rabbit hole, it seems that the container image doesnt matter - ubuntu or rocky, same effect.
Forcing {% set sst_method = 'rsync' %} in galera.conf.j2 also doesn't help anything.
Manual netcat connection checks on all involved ports succeed.
Dropping MTU on the IP interface to 1500 does not help, neither does setting the MTU in docker's daemon.json
Iperf runs between the nodes show full bandwidth, no error counters on NICs or switch ports (bonds).
tcpdump observation of the SST port shows _no traffic at all_ during the boostrap:
```
# tcpdump -vvv -n -i br-bond0-222 port 4444 or port 4568
tcpdump: listening on br-bond0-222, link-type EN10MB (Ethernet), snapshot length 262144 bytes

```

MaaS did deploy the node with multiple routing tables in the netplan, but they all match up and connectivity tests pass with flying colors.

Something about 22.04, or how kolla-ansible configures it, seems to prevent seeding of mariadb slaves breaking the stack. Having trouble coming up with a rational explanation for why no data is seen between the donor and joiner.

Revision history for this message

Sven Kieske (s-kieske) wrote on 2023-06-15:

#6

can you check if you have the backport for this patch? https://review.opendev.org/c/openstack/kolla-ansible/+/839715

it changed the way ovs healthcheck works and a user reported on IRC that they had the exact same issue as you, without the above patch.

kolla-ansible

Deployment fails with MariaDB on kolla-ansible 15.0.1.dev1

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches