Comment 10 for bug 1712087

Revision history for this message
Dmitry Rachkov (dmitry-rachkov) wrote :

Hi Mark,

Thanks for replying and trying to help!

Will try to give more details hopefully picturing setup and symptoms better.

ENV:
multinode (tried with 4 nodes 2 of which controllers and tried with 5 nodes 2
of
which controllers)
kolla_base_distro: "centos"
kolla_install_type: "source"
openstack_release: "ussuri"
pip3 freeze | grep kolla
kolla==10.1.0
kolla-ansible==10.1.0

ISSUE AND SYMPTOMS:

Description: kolla-ansible fails on mariadb port liveness verification not
bringing up db cluster properly. As i tried multiple times (clean way -
destroying everything and restarting docker before I start over) - I see that
from time to time I get different errors, though on same step - handler mariadb
port liveness.

1) The one I explained in previous post with python "[Errno 104] Connection
reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact
error", "rc": 1}" -> where investigation shows that one node has its mariadb
container up and running but the other doesn't even have it in pulled images,
so not a surprised cluster quorum fails. What is not clear why second
controller node doesn't have container pulled and started.

2) Is again on mariadb port liveness verification but with different exit code
and overall situation, it appears that none of controller nodes have images
pulled hence kolla-ansible is not able to start the cluster due to this and
giving
"2020-10-20 08:11:52,517 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED!
=> {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please
start it using kolla-ansible
 mariadb_recovery"}
2020-10-20 08:11:52,551 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED!
=>
{"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start
it using kolla-ansible
 mariadb_recovery"}"

 - The continuation of with "kolla-ansible -i multinode mariadb_recovery" gives
similar results

"TASK [mariadb : Stop MariaDB containers]
********************************************************************************
*************************************************************
fatal: [osce3]: FAILED! => {"changed": false, "msg": "No such container:
mariadb
to stop"}"

Docker inspect shows on both nodes:
 docker inspect mariadb
[
    {
        "CreatedAt": "2020-10-20T08:11:42+03:00",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/mariadb/_data",
        "Name": "mariadb",
        "Options": null,
        "Scope": "local"
    }
]

I also attach below the historygram of ansible log covering mariadb logic
execution for issue #2 that highlights the strange abscene of kolla_docker:
pull action for mariadb role.

ANSIBLE_LOG:

-FOLDERS and CONFIGS only

2020-10-20 08:11:25,033 p=151140 u=lineng n=ansible | TASK [mariadb :Ensuring config directories exist] *****************************
2020-10-20 08:11:26,755 p=151140 u=lineng n=ansible | TASK [mariadb : Ensuring database backup config directory exists] **************
2020-10-20 08:11:26,941 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over my.cnf for mariabackup] ***************************
2020-10-20 08:11:27,128 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over config.json files for services] *******************
2020-10-20 08:11:29,754 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over config.json files for mariabackup] ****************
2020-10-20 08:11:29,941 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over galera.cnf] ***************************************
2020-10-20 08:11:32,342 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over wsrep-notify.sh] **********************************
2020-10-20 08:11:35,117 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over xinetd clustercheck.conf] *************************

-Check containers afterwards but this still doesn't do any pulling logic based on src code

2020-10-20 08:32:19,005 p=161054 u=lineng n=ansible | TASK [Check mariadb containers]
2020-10-20 08:11:38,589 p=151140 u=lineng n=ansible | changed: [osce3] => (item={'key': 'mariadb', 'value': {'container_name': 'mariadb', 'group': 'mariadb', 'enabled': True, 'image'
: 'kolla/centos-source-mariadb:ussuri', 'volumes': ['/etc/kolla/mariadb/:/var/lib/kolla/config_files/:ro', '/etc/localtime:/etc/localtime:ro', '', 'mariadb:/var/lib/mysql', 'kolla_lo
gs:/var/log/kolla/'], 'dimensions': {}, 'haproxy': {'mariadb': {'enabled': True, 'mode': 'tcp', 'port': '3306', 'listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'time
out client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s', 'option httpchk'], 'custom_member_list': ['server osce3 192.168.1.6:3306 check port 4569 inter 2
000 rise 2 fall 5', 'server osce4 192.168.1.7:3306 check port 4569 inter 2000 rise 2 fall 5 backup', '']}, 'mariadb_external_lb': {'enabled': False, 'mode': 'tcp', 'port': '3306', '
listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'timeout client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s'], 'custom_member_list': ['ser
ver osce3 osce3:3306 check port 4569 inter 2000 rise 2 fall 5', 'server osce4 osce4:3306 check port 4569 inter 2000 rise 2 fall 5 backup', '']}}}})
2020-10-20 08:11:38,759 p=151140 u=lineng n=ansible | changed: [osce4] => (item={'key': 'mariadb', 'value': {'container_name': 'mariadb', 'group': 'mariadb', 'enabled': True, 'image'
: 'kolla/centos-source-mariadb:ussuri', 'volumes': ['/etc/kolla/mariadb/:/var/lib/kolla/config_files/:ro', '/etc/localtime:/etc/localtime:ro', '', 'mariadb:/var/lib/mysql', 'kolla_lo
gs:/var/log/kolla/'], 'dimensions': {}, 'haproxy': {'mariadb': {'enabled': True, 'mode': 'tcp', 'port': '3306', 'listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'time
out client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s', 'option httpchk'], 'custom_member_list': ['server osce3 192.168.1.6:3306 check port 4569 inter 2
000 rise 2 fall 5', 'server osce4 192.168.1.7:3306 check port 4569 inter 2000 rise 2 fall 5 backup', '']}, 'mariadb_external_lb': {'enabled': False, 'mode': 'tcp', 'port': '3306', '
listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'timeout client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s'], 'custom_member_list': ['ser
ver osce3 osce3:3306 check port 4569 inter 2000 rise 2 fall 5', 'server osce4 osce4:3306 check port 4569 inter 2000 rise 2 fall 5 backup', '']}}}})
2020-10-20 08:11:39,205 p=151140 u=lineng n=ansible | changed: [osce3] => (item={'key': 'mariadb-clustercheck', 'value': {'container_name': 'mariadb_clustercheck', 'group': 'mariadb'
, 'enabled': True, 'image': 'kolla/centos-source-mariadb-clustercheck:ussuri', 'volumes': ['/etc/kolla/mariadb-clustercheck/:/var/lib/kolla/config_files/:ro', '/etc/localtime:/etc/lo
caltime:ro', '', 'kolla_logs:/var/log/kolla/'], 'dimensions': {}, 'environment': {'MYSQL_USERNAME': 'haproxy', 'MYSQL_PASSWORD': '', 'MYSQL_HOST': '192.168.1.6', 'AVAILABLE_WHEN_DONO
R': '1'}}})

- Finishing up with mode docker tasks for volumes

2020-10-20 08:11:40,187 p=151140 u=lineng n=ansible | TASK [mariadb : Create MariaDB volume] *****************************************
2020-10-20 08:11:40,988 p=151140 u=lineng n=ansible | TASK [mariadb : Divide hosts by their MariaDB volume availability] *************
2020-10-20 08:11:41,181 p=151140 u=lineng n=ansible | TASK [mariadb : Establish whether the cluster has already existed] *************

- And service port liveliness

2020-10-20 08:11:41,385 p=151140 u=lineng n=ansible | TASK [mariadb : Check MariaDB service port liveness] ***************************
2020-10-20 08:11:51,989 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.1
.6:3306"}
2020-10-20 08:11:51,989 p=151140 u=lineng n=ansible | ...ignoring
2020-10-20 08:11:52,145 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.1
.7:3306"}
2020-10-20 08:11:52,146 p=151140 u=lineng n=ansible | ...ignoring
2020-10-20 08:11:52,225 p=151140 u=lineng n=ansible | TASK [mariadb : Divide hosts by their MariaDB service port liveness] ***********
2020-10-20 08:11:52,319 p=151140 u=lineng n=ansible | changed: [osce3]
2020-10-20 08:11:52,344 p=151140 u=lineng n=ansible | changed: [osce4]
2020-10-20 08:11:52,424 p=151140 u=lineng n=ansible | TASK [mariadb : Fail on existing but stopped cluster] **************************
2020-10-20 08:11:52,517 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible
 mariadb_recovery"}
2020-10-20 08:11:52,551 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible
 mariadb_recovery"}
2020-10-20 08:11:52,552 p=151140 u=lineng n=ansible | RUNNING HANDLER [mariadb : Restart MariaDB on existing cluster members] ********
2020-10-20 08:11:52,552 p=151140 u=lineng n=ansible | RUNNING HANDLER [mariadb : Start MariaDB on new nodes] *************************
2020-10-20 08:11:52,553 p=151140 u=lineng n=ansible | RUNNING HANDLER [Restart mariadb-clustercheck container] ***********************
2020-10-20 08:11:52,554 p=151140 u=lineng n=ansible | PLAY RECAP *********************************************************************

Thanks and let me know if I should attach inventory, globals and all.yml for complete picture?

Cheers