Deployment fails with MariaDB on kolla-ansible 15.0.1.dev1

Bug #2002465 reported by Jose Gaitan
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Invalid
Undecided
Unassigned

Bug Description

Deployment of kolla-ansible 15.0.1.dev1 fails on the following task:

RUNNING HANDLER [mariadb : Wait for MariaDB service port liveness] **********************************************************************************************************************
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (10 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (9 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (8 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (7 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (6 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (5 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (4 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (3 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (2 retries left).
FAILED - RETRYING: [inf2-mia]: Wait for MariaDB service port liveness (1 retries left).
fatal: [inf2-mia]: FAILED! => {"attempts": 10, "changed": false, "elapsed": 60, "msg": "Timeout when waiting for search string MariaDB in 10.10.36.12:3306"}

===
Environment:
-multinode
-Target Hosts:
  -Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-57-generic x86_64)
  -3x controllers
  -2x compute
  -Docker version 20.10.22, build 3a2c30b

Deployment host:
-venv: pip install git+https://opendev.org/openstack/kolla-ansible@stable/zed

===
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8440b4be0d6b quay.io/openstack.kolla/mariadb-server:zed-ubuntu-jammy "dumb-init -- kolla_…" 27 minutes ago Up 32 seconds (health: starting) mariadb

Revision history for this message
Jose Gaitan (vchjgaitan) wrote :
Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

A lot of connection timeouts in MariaDB logs - please check your network.

Changed in kolla-ansible:
status: New → Invalid
Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

I am seeing the same thing against 22.04 - mariadbd listens on the node's IP port 3306 when bootstrapped on initial node, but subsequent nodes only listen on 3306 from haproxy on the shared VIP - mariadbd never gets to a listening state, even if restarting the container.
Path MTU checks out in my cluster, no network errors, and i have an identical xena setup on 20.04 running next to it - pretty sure its not the network.

The slaves don't complete their SST pull:
```
230507 13:41:37 mysqld_safe WSREP: Running position recovery with --disable-log-error --pid-file='/var/lib/mysql//svsc-osm01-recover.pid'
2023-05-07 13:41:38 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.217.122.11:4444' --datadir '/var/lib/mysql/' --parent 231 --progress 0 --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/mysql/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20230507 13:46:39.211)
WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20230507 13:46:39.214)

```
but no error is reported on the first node bootrapped (the supposed donor). Once the two slaves fail, ansible restarts the master, and it spits out:
```
2023-05-07 14:05:41 0 [Warning] WSREP: No re-merged primary component found.
2023-05-07 14:05:41 0 [Warning] WSREP: No bootstrapped primary component found.
2023-05-07 14:05:41 0 [ERROR] WSREP: ./gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
```
resulting in a dead deployment - attempting to recover mariadb has no effect.
It almost looks like there _isnt any data to replicate_ to begin with on the bootstrapping node...

On Xena/20.04, we had to set enable_mariadb_clustercheck: "yes" for ubuntu to work properly; here, it has no effect.

The failure is occurring within mariadb's cluster setup in master->slave replication, despite the metal upon which these containers are running being able to `ping -M do -s 9072` each other (and their NIC MTUs being between 9000-9100). There are no firewalls running on the hosts to prevent comms between mariadb containers.

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):
Download full text (3.7 KiB)

Digging further, it does appear that the initially bootstrapped master is broken somehow:
```
2023-05-07 15:06:46 0 [Warning] WSREP: Quorum: No node with complete state:

 Version : 6
 Flags : 0x1
 Protocols : 2 / 10 / 4
 State : NON-PRIMARY
 Desync count : 0
 Prim state : NON-PRIMARY
 Prim UUID : 00000000-0000-0000-0000-000000000000
 Prim seqno : -1
 First seqno : -1
 Last seqno : -1
 Commit cut : -1
 Last vote : -1.0
 Vote policy : 0
 Prim JOINED : 0
 State UUID : c91ea93c-ece8-11ed-bcbd-d37542ae8d5e
 Group UUID : 00000000-0000-0000-0000-000000000000
 Name : 'mgr02'
 Incoming addr: '10.217.122.12:3306'

 Version : 6
 Flags : 0x2
 Protocols : 2 / 10 / 4
 State : NON-PRIMARY
 Desync count : 0
 Prim state : SYNCED
 Prim UUID : 5d5b432a-ece8-11ed-9285-be6bf8d043bf
 Prim seqno : 4
 First seqno : 1
 Last seqno : 4
 Commit cut : 0
 Last vote : -1.0
 Vote policy : 0
 Prim JOINED : 1
 State UUID : c91ea93c-ece8-11ed-bcbd-d37542ae8d5e
 Group UUID : 78a25580-ece7-11ed-863d-5af2d3394150
 Name : 'mgr00'
 Incoming addr: '10.217.122.10:3306'
```
^^ is what happens after the replication error on initial bootstrap of a slave, and that slave is restarted

The donor, being restarted after the two slaves fail to come up, is aware of them but can't figure out which one is primary:
```
023-05-07 20:47:51 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6f57006e-ed18-11ed-9165-5f46e7399b53 from 1 (mgr00)
2023-05-07 20:47:51 0 [Warning] WSREP: Quorum: No node with complete state:

 Version : 6
 Flags : 0x1
 Protocols : 2 / 10 / 4
 State : NON-PRIMARY
 Desync count : 0
 Prim state : PRIMARY
 Prim UUID : cc9462f9-ed16-11ed-8544-d3da5d985617
 Prim seqno : 5
 First seqno : -1
 Last seqno : 5
 Commit cut : 0
 Last vote : -1.0
 Vote policy : 0
 Prim JOINED : 1
 State UUID : 6f57006e-ed18-11ed-9165-5f46e7399b53
 Group UUID : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
 Name : 'mgr01'
 Incoming addr: '10.217.122.11:3306'

 Version : 6
 Flags : 00
 Protocols : 2 / 10 / 4
 State : NON-PRIMARY
 Desync count : 0
 Prim state : NON-PRIMARY
 Prim UUID : 00000000-0000-0000-0000-000000000000
 Prim seqno : -1
 First seqno : 1
 Last seqno : 5
 Commit cut : 5
 Last vote : -1.0
 Vote policy : 0
 Prim JOINED : 0
 State UUID : 6f57006e-ed18-11ed-9165-5f46e7399b53
 Group UUID : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
 Name : 'mgr00'
 Incoming addr: '10.217.122.10:3306'

 Version : 6
 Flags : 00
 Protocols : 2 / 10 / 4
 State : NON-PRIMARY
 Desync count : 0
 Prim state : PRIMARY
 Prim UUID : cc9462f9-ed16-11ed-8544-d3da5d985617
 Prim seqno : 5
 First seqno : -1
 Last seqno : 5
 Commit cut : 0
 Last vote : -1.0
 Vote policy : 0
 Prim JOINED : 1
 State UUID : 6f57006e-ed18-11ed-9165-5f46e7399b53
 Group UUID : 2d143fa4-ed15-11ed-b1b9-03a39ef3932a
 Name : 'mgr02'
 Incoming addr: '10.217.122.12:3306'

2023-05-07 20:47:51 0 [Warning] WSREP: No re-merged primary component found.
2023-05-07 20:47:51 0 [Warning] WSREP: No bootstrapped primary component found....

Read more...

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

Further down the rabbit hole, it seems that the container image doesnt matter - ubuntu or rocky, same effect.
Forcing {% set sst_method = 'rsync' %} in galera.conf.j2 also doesn't help anything.
Manual netcat connection checks on all involved ports succeed.
Dropping MTU on the IP interface to 1500 does not help, neither does setting the MTU in docker's daemon.json
Iperf runs between the nodes show full bandwidth, no error counters on NICs or switch ports (bonds).
tcpdump observation of the SST port shows _no traffic at all_ during the boostrap:
```
# tcpdump -vvv -n -i br-bond0-222 port 4444 or port 4568
tcpdump: listening on br-bond0-222, link-type EN10MB (Ethernet), snapshot length 262144 bytes

```

MaaS did deploy the node with multiple routing tables in the netplan, but they all match up and connectivity tests pass with flying colors.

Something about 22.04, or how kolla-ansible configures it, seems to prevent seeding of mariadb slaves breaking the stack. Having trouble coming up with a rational explanation for why no data is seen between the donor and joiner.

Revision history for this message
Sven Kieske (s-kieske) wrote :

can you check if you have the backport for this patch? https://review.opendev.org/c/openstack/kolla-ansible/+/839715

it changed the way ovs healthcheck works and a user reported on IRC that they had the exact same issue as you, without the above patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.