kolla-ansible

MariaDB in HA does not come up after kolla-ansible stop

Bug #1712087 reported by Sean Murphy on 2017-08-21

This bug affects 9 people

Affects		Status	Importance	Assigned to	Milestone
	kolla-ansible	Opinion	Undecided	Unassigned

Bug Description

While trying to add new services to a kolla-ansible deployment, MariaDB containers on all controller nodes were stopped (with kolla-ansible stop) and will not come back up again in a sensible state.

OS: CentOS 7.3
Kolla-ansible ver: 4.0.0
Multinode environment with 3 controller nodes, 7 compute nodes and 3 ceph storage nodes

I was adding new services using the following procedure - I was adding Gnocchi and Aodh although I guess this is not so important:
- shelve VMs
- stop containers using kolla-ansible stop
- remove Virtual IPs from controller interfaces
- run prechecks - kolla-ansible prechecks
- run deploy - kolla-ansible deploy

The prechecks and all steps preceding worked fine.

The deployment gave problems when deploying MariaDB - see output below.

I observed on the controller nodes that the mariadb containers were there but in a constant restart cycle. Logs from the mariadb container below.

----- (Logs from container)

170821 16:12:23 [Note] WSREP: Start replication
170821 16:12:23 [Note] WSREP: Setting initial position to cb03500b-7ddb-11e7-8ace-fb6142a8546a:4016309
170821 16:12:23 [Note] WSREP: protonet asio version 0
170821 16:12:23 [Note] WSREP: Using CRC-32C for message checksums.
170821 16:12:23 [Note] WSREP: backend: asio
170821 16:12:23 [Note] WSREP: gcomm thread scheduling priority set to other:0
170821 16:12:23 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170821 16:12:23 [Note] WSREP: restore pc from disk failed
170821 16:12:23 [Note] WSREP: GMCast version 0
170821 16:12:23 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') listening at tcp://192.168.10.2:4567
170821 16:12:23 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') multicast: , ttl: 1
170821 16:12:23 [Note] WSREP: EVS version 0
170821 16:12:23 [Note] WSREP: gcomm: connecting to group 'openstack', peer '192.168.10.2:4567,192.168.10.3:4567,192.168.10.4:4567'
170821 16:12:23 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') connection established to c0990df6 tcp://192.168.10.3:4567
170821 16:12:23 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') turning message relay requesting on, nonlive peers:
170821 16:12:23 [Note] WSREP: declaring c0990df6 at tcp://192.168.10.3:4567 stable
170821 16:12:23 [Warning] WSREP: no nodes coming from prim view, prim not possible
170821 16:12:23 [Note] WSREP: view(view_id(NON_PRIM,c0990df6,1) memb {
        c0990df6,0
        c0ce933d,0
} joined {
} left {
} partitioned {
})
170821 16:12:27 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') turning message relay requesting off
170821 16:12:28 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') connection established to c3a45fd0 tcp://192.168.10.4:4567
170821 16:12:28 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') turning message relay requesting on, nonlive peers:
170821 16:12:28 [Note] WSREP: declaring c0990df6 at tcp://192.168.10.3:4567 stable
170821 16:12:28 [Note] WSREP: declaring c3a45fd0 at tcp://192.168.10.4:4567 stable
170821 16:12:28 [Warning] WSREP: no nodes coming from prim view, prim not possible
170821 16:12:28 [Note] WSREP: view(view_id(NON_PRIM,c0990df6,2) memb {
        c0990df6,0
        c0ce933d,0
        c3a45fd0,0
} joined {
} left {
} partitioned {
})
170821 16:12:31 [Note] WSREP: (c0ce933d, 'tcp://192.168.10.2:4567') turning message relay requesting off
170821 16:12:54 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():158
170821 16:12:54 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
170821 16:12:54 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'openstack' at 'gcomm://192.168.10.2:4567,192.168.10.3:4567,192.168.10.4:4567': -110 (Connection timed out)
170821 16:12:54 [ERROR] WSREP: gcs connect failed: Connection timed out
170821 16:12:54 [ERROR] WSREP: wsrep::connect(gcomm://192.168.10.2:4567,192.168.10.3:4567,192.168.10.4:4567) failed: 7
170821 16:12:54 [ERROR] Aborting

170821 16:12:54 [Note] WSREP: Service disconnected.
170821 16:12:55 [Note] WSREP: Some threads may fail to exit.
170821 16:12:55 [Note] /usr/sbin/mysqld: Shutdown complete

----- (output from kolla-ansible deploy - just the mariadb section)

PLAY [Apply role mariadb] ******************************************************

TASK [setup] *******************************************************************
ok: [ned-controller-2]
ok: [ned-controller-1]
ok: [ned-controller-3]

TASK [common : include] ********************************************************
skipping: [ned-controller-1]
skipping: [ned-controller-2]
skipping: [ned-controller-3]

TASK [common : Registering common role has run] ********************************
skipping: [ned-controller-2]
skipping: [ned-controller-1]
skipping: [ned-controller-3]

TASK [mariadb : include] *******************************************************
included: /root/kolla-ocata/kolla-ansible/ansible/roles/mariadb/tasks/deploy.yml for ned-controller-1, ned-controller-2, ned-controller-3

TASK [mariadb : include] *******************************************************
included: /root/kolla-ocata/kolla-ansible/ansible/roles/mariadb/tasks/config.yml for ned-controller-1, ned-controller-2, ned-controller-3

TASK [mariadb : Ensuring config directories exist] *****************************
ok: [ned-controller-1] => (item=mariadb)
ok: [ned-controller-3] => (item=mariadb)
ok: [ned-controller-2] => (item=mariadb)

TASK [mariadb : Copying over config.json files for services] *******************
ok: [ned-controller-1] => (item=mariadb)
ok: [ned-controller-3] => (item=mariadb)
ok: [ned-controller-2] => (item=mariadb)

TASK [mariadb : Copying over galera.cnf] ***************************************
ok: [ned-controller-2] => (item=mariadb)
ok: [ned-controller-3] => (item=mariadb)
ok: [ned-controller-1] => (item=mariadb)

TASK [mariadb : Copying over wsrep-notify.sh] **********************************
ok: [ned-controller-2] => (item=mariadb)
ok: [ned-controller-3] => (item=mariadb)
ok: [ned-controller-1] => (item=mariadb)

TASK [mariadb : include] *******************************************************
included: /root/kolla-ocata/kolla-ansible/ansible/roles/mariadb/tasks/bootstrap.yml for ned-controller-1, ned-controller-2, ned-controller-3

TASK [mariadb : include] *******************************************************
included: /root/kolla-ocata/kolla-ansible/ansible/roles/mariadb/tasks/lookup_cluster.yml for ned-controller-1, ned-controller-2, ned-controller-3

TASK [mariadb : Cleaning up temp file on localhost] ****************************
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be
disabled by setting deprecation_warnings=False in ansible.cfg.
ok: [ned-controller-1 -> localhost]

TASK [mariadb : Creating temp file on localhost] *******************************
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be
disabled by setting deprecation_warnings=False in ansible.cfg.
ok: [ned-controller-1 -> localhost]

TASK [mariadb : Creating mariadb volume] ***************************************
ok: [ned-controller-1]
ok: [ned-controller-2]
ok: [ned-controller-3]

TASK [mariadb : Writing hostname of host with existing cluster files to temp file] ***
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be
disabled by setting deprecation_warnings=False in ansible.cfg.
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be
disabled by setting deprecation_warnings=False in ansible.cfg.
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be
disabled by setting deprecation_warnings=False in ansible.cfg.
ok: [ned-controller-1 -> localhost]
ok: [ned-controller-2 -> localhost]
ok: [ned-controller-3 -> localhost]

TASK [mariadb : Registering host from temp file] *******************************
ok: [ned-controller-1]
ok: [ned-controller-2]
ok: [ned-controller-3]

TASK [mariadb : include] *******************************************************
skipping: [ned-controller-1]
skipping: [ned-controller-2]
skipping: [ned-controller-3]

TASK [mariadb : include] *******************************************************
included: /root/kolla-ocata/kolla-ansible/ansible/roles/mariadb/tasks/start.yml for ned-controller-1, ned-controller-2, ned-controller-3

TASK [mariadb : Starting mariadb container] ************************************
ok: [ned-controller-3]
ok: [ned-controller-1]
ok: [ned-controller-2]

TASK [mariadb : Waiting for MariaDB service to be ready] ***********************
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (10 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (10 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (10 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (9 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (9 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (9 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (8 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (8 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (8 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (7 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (7 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (7 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (6 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (6 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (6 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (5 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (5 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (5 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (4 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (4 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (4 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (3 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (3 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (3 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (2 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (2 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (2 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (1 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (1 retries left).
FAILED - RETRYING: TASK: mariadb : Waiting for MariaDB service to be ready (1 retries left).
fatal: [ned-controller-1]: FAILED! => {"attempts": 10, "changed": false, "failed": true, "module_stderr": "Shared connection to ned-controller-1 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_oOJtcC/ansible_module_wait_for.py\", line 540, in <module>\r\n main()\r\n File \"/tmp/ansible_oOJtcC/ansible_module_wait_for.py\", line 481, in main\r\n response = s.recv(1024)\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE"}
fatal: [ned-controller-2]: FAILED! => {"attempts": 10, "changed": false, "failed": true, "module_stderr": "Shared connection to ned-controller-2 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_mkPgXV/ansible_module_wait_for.py\", line 540, in <module>\r\n main()\r\n File \"/tmp/ansible_mkPgXV/ansible_module_wait_for.py\", line 481, in main\r\n response = s.recv(1024)\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE"}
fatal: [ned-controller-3]: FAILED! => {"attempts": 10, "changed": false, "failed": true, "module_stderr": "Shared connection to ned-controller-3 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_YoaTtJ/ansible_module_wait_for.py\", line 540, in <module>\r\n main()\r\n File \"/tmp/ansible_YoaTtJ/ansible_module_wait_for.py\", line 481, in main\r\n response = s.recv(1024)\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE"}
to retry, use: --limit @/root/kolla-ocata/kolla-ansible/ansible/site.retry

Revision history for this message

Tudosoiu Marian (mtudosoiu) wrote on 2017-12-19:

I've noticed the same issue using kolla-ansible for Pike release.
From my investigations when mariadb container is stopped the following file /var/lib/docker/volumes/mariadb/_data/gvwstate.dat is deleted.

As an workaround I've created a copy of the gvwstate.dat file before the container was stoped.
Once the mariadb containers were started again I used the back-up file for gvwstate.dat to recover mariadb cluster

Revision history for this message

Alex Fortin (alexfortin17) wrote on 2018-04-01:

Seeing same issue using Pike. Tested mtudosoiu workaround with success

Revision history for this message

Sean Murphy (murp) wrote on 2018-04-03:

We just use a manual procedure which is basically the same as the mariadb_recovery functionality of kolla-ansible (and the processes noted above), except we can see directly what is going on under the hood - we prefer this approach given that the db data is important to us.

Revision history for this message

Derek Yang (hswayne77) wrote on 2018-06-14:

We have the same issues in queens release on Ubuntu xenial, anyone else could give us some suggestions? Thank you

Revision history for this message

Sean Murphy (murp) wrote on 2018-06-14:

This is our process (which is basically what the kolla bootstrap mariadb function does):

- Check the following file on every controller node

root@controller-1:/var/lib/docker/volumes/mariadb/_data# cat grastate.dat && echo
# GALERA saved state
version: 2.1
uuid: fbf02533-f205-11e7-ad67-ee3d1299a58d
seqno: -1
safe_to_bootstrap: 0

- Choose the controller with the highest seqno value.
- Change save_to_bootstrap: 0 to save_to_bootstrap: 1 on this node
- On this node, run

docker run --net host --name mariadbbootstrap -v /etc/localtime:/etc/localtime:ro -v kolla_logs:/var/log/kolla/ -v mariadb:/var/lib/mysql -v /etc/kolla/mariadb/:/var/lib/kolla/config_files/:ro --restart on-failure:10 --env KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env BOOTSTRAP_ARGS='--wsrep-new-cluster' 172.28.0.1:5000/kolla/ubuntu-binary-mariadb:ocata

(You will have to modify the IP add and perhaps the label of the container depending on
your situation). Wait until it looks like the container come up in a healthy state - this
can take about a minute usually.

- Restart the MariaDB Docker containers on the two other controller nodes - do it one by one. They should come up now. Look at the logs to see if they sync with the other mariadb instances in the cluster.
- Stop and remove the bootstrap container -> docker rm -f mariadbbootstrap
- Restart mariadb container on the node on which the rescue container was run. -> docker restart mariadb.

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2018-09-28:

DB cluster shutdown should be done in order and start up in the same order too. If not doing this, will need to use mariadb_recovery or manually recover the master state of the cluster.

Changed in kolla-ansible:
status:	New → Opinion

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-05-15:

You can use the 'kolla-ansible mariadb_recovery' command to bring up the Galera cluster if it has stopped.

Revision history for this message

Dmitry Rachkov (dmitry-rachkov) wrote on 2020-10-19:

Hi all.

I'm experiencing the same bug on ussuri, 5 node cluster where the second controller node is stuck with galera being restarted and ansible stuck on "mariadb : Wait for MariaDB service port liveness" task.

I see also in a source code this commente "# NOTE(yoctozepto): We have to loop this to avoid breaking on connection resets" on-top of the handler task for port liveness.

Can somebody confirm that this is an existing bug and still not fixed and on recommended solution?

Thank you

2020-10-19 07:44:08,242 p=601358 u=lineng n=ansible | RUNNING HANDLER [mariadb : Wait for MariaDB service port liveness] *************
2020-10-19 07:50:21,448 p=601358 u=lineng n=ansible | An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ConnectionResetError: [Errno 104] Connection reset by peer
2020-10-19 07:50:21,449 p=601358 u=lineng n=ansible | fatal: [osce5]: FAILED! => {"attempts": 10, "changed": false, "module_stderr": "Shared connection to 192.168.1.8 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/home/lineng/.ansible/tmp/ansible-tmp-1603082990.52845-603995-148496658340252/AnsiballZ_wait_for.py\", line 102, in <module>\r\n _ansiballz_main()\r\n File \"/home/lineng/.ansible/tmp/ansible-tmp-1603082990.52845-603995-148496658340252/AnsiballZ_wait_for.py\", line 94, in _ansiballz_main\r\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\r\n File \"/home/lineng/.ansible/tmp/ansible-tmp-1603082990.52845-603995-148496658340252/AnsiballZ_wait_for.py\", line 40, in invoke_module\r\n runpy.run_module(mod_name='ansible.modules.utilities.logic.wait_for', init_globals=None, run_name='__main__', alter_sys=True)\r\n File \"/usr/lib64/python3.6/runpy.py\", line 205, in run_module\r\n return _run_module_code(code, init_globals, run_name, mod_spec)\r\n File \"/usr/lib64/python3.6/runpy.py\", line 96, in _run_module_code\r\n mod_name, mod_spec, pkg_name, script_name)\r\n File \"/usr/lib64/python3.6/runpy.py\", line 85, in _run_code\r\n exec(code, run_globals)\r\n File \"/tmp/ansible_wait_for_payload_akcbsjk8/ansible_wait_for_payload.zip/ansible/modules/utilities/logic/wait_for.py\", line 687, in <module>\r\n File \"/tmp/ansible_wait_for_payload_akcbsjk8/ansible_wait_for_payload.zip/ansible/modules/utilities/logic/wait_for.py\", line 615, in main\r\nConnectionResetError: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}

Hi all.

I see also in a source code this commente "# NOTE(yoctozepto): We have to loop this to avoid breaking on connection resets" on-top of the handler task for port liveness.

Can somebody confirm that this is an existing bug and still not fixed and on recommended solution?

Thank you

2020-10-19 07:44:08,242 p=601358 u=lineng n=ansible | RUNNING HANDLER [mariadb : Wait for MariaDB service port liveness] *************
2020-10-19 07:50:21,448 p=601358 u=lineng n=ansible | An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ConnectionResetError: [Errno 104] Connection reset by peer
2020-10-19 07:50:21,449 p=601358 u=lineng n=ansible | fatal: [osce5]: FAILED! => {"attempts": 10, "changed": false, "module_stderr": "Shared connection to 192.168.1.8 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n  File \"/home/lineng/.ansible/tmp/ansible-tmp-1603082990.52845-603995-148496658340252/AnsiballZ_wait_for.py\", line 102, in <module>\r\n    _ansiballz_main()\r\n  File \"/home/lineng/.ansible/tmp/ansible-tmp-1603082990.52845-603995-148496658340252/AnsiballZ_wait_for.py\", line 94, in _ansiballz_main\r\n    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\r\n  File \"/home/lineng/.ansible/tmp/ansible-tmp-1603082990.52845-603995-148496658340252/AnsiballZ_wait_for.py\", line 40, in invoke_module\r\n    runpy.run_module(mod_name='ansible.modules.utilities.logic.wait_for', init_globals=None, run_name='__main__', alter_sys=True)\r\n  File \"/usr/lib64/python3.6/runpy.py\", line 205, in run_module\r\n    return _run_module_code(code, init_globals, run_name, mod_spec)\r\n  File \"/usr/lib64/python3.6/runpy.py\", line 96, in _run_module_code\r\n    mod_name, mod_spec, pkg_name, script_name)\r\n  File \"/usr/lib64/python3.6/runpy.py\", line 85, in _run_code\r\n    exec(code, run_globals)\r\n  File \"/tmp/ansible_wait_for_payload_akcbsjk8/ansible_wait_for_payload.zip/ansible/modules/utilities/logic/wait_for.py\", line 687, in <module>\r\n  File \"/tmp/ansible_wait_for_payload_akcbsjk8/ansible_wait_for_payload.zip/ansible/modules/utilities/logic/wait_for.py\", line 615, in main\r\nConnectionResetError: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}

Revision history for this message

Mark Goddard (mgoddard) wrote on 2020-10-19:

Dmitry, you should check the mariadb logs on the host that failed to come up, to see why it did not join the cluster.

Revision history for this message

Dmitry Rachkov (dmitry-rachkov) wrote on 2020-10-20:

#10

Download full text (9.5 KiB)

Hi Mark,

Thanks for replying and trying to help!

Will try to give more details hopefully picturing setup and symptoms better.

ENV:
multinode (tried with 4 nodes 2 of which controllers and tried with 5 nodes 2
of
which controllers)
kolla_base_distro: "centos"
kolla_install_type: "source"
openstack_release: "ussuri"
pip3 freeze | grep kolla
kolla==10.1.0
kolla-ansible==10.1.0

ISSUE AND SYMPTOMS:

Description: kolla-ansible fails on mariadb port liveness verification not
bringing up db cluster properly. As i tried multiple times (clean way -
destroying everything and restarting docker before I start over) - I see that
from time to time I get different errors, though on same step - handler mariadb
port liveness.

1) The one I explained in previous post with python "[Errno 104] Connection
reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact
error", "rc": 1}" -> where investigation shows that one node has its mariadb
container up and running but the other doesn't even have it in pulled images,
so not a surprised cluster quorum fails. What is not clear why second
controller node doesn't have container pulled and started.

2) Is again on mariadb port liveness verification but with different exit code
and overall situation, it appears that none of controller nodes have images
pulled hence kolla-ansible is not able to start the cluster due to this and
giving
"2020-10-20 08:11:52,517 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED!
=> {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please
start it using kolla-ansible
mariadb_recovery"}
2020-10-20 08:11:52,551 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED!
=>
{"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start
it using kolla-ansible
mariadb_recovery"}"

- The continuation of with "kolla-ansible -i multinode mariadb_recovery" gives
similar results

"TASK [mariadb : Stop MariaDB containers]
********************************************************************************
*************************************************************
fatal: [osce3]: FAILED! => {"changed": false, "msg": "No such container:
mariadb
to stop"}"

Docker inspect shows on both nodes:
docker inspect mariadb
[
    {
        "CreatedAt": "2020-10-20T08:11:42+03:00",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/mariadb/_data",
        "Name": "mariadb",
        "Options": null,
        "Scope": "local"
    }
]

I also attach below the historygram of ansible log covering mariadb logic
execution for issue #2 that highlights the strange abscene of kolla_docker:
pull action for mariadb role.

ANSIBLE_LOG:

-FOLDERS and CONFIGS only

Hi Mark,

Thanks for replying and trying to help!

Will try to give more details hopefully picturing setup and symptoms better.

ENV:
multinode (tried with 4 nodes 2 of which controllers and tried with 5 nodes 2 
of 
which controllers)
kolla_base_distro: "centos"
kolla_install_type: "source"
openstack_release: "ussuri"
pip3 freeze | grep kolla
kolla==10.1.0
kolla-ansible==10.1.0

ISSUE AND SYMPTOMS:

Description: kolla-ansible fails on mariadb port liveness verification not 
bringing up db cluster properly. As i tried multiple times (clean way - 
destroying everything and restarting docker before I start over) - I see that 
from time to time I get different errors, though on same step - handler mariadb 
port liveness.

1) The one I explained in previous post with python "[Errno 104] Connection 
reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact 
error", "rc": 1}" -> where investigation  shows that one node has its mariadb 
container up and running but the other doesn't even have it in pulled images, 
so not a surprised cluster quorum fails. What is not clear why second 
controller node doesn't have container pulled and started.

2) Is again on  mariadb port liveness verification but with different exit code 
and overall situation, it appears that none of controller nodes have images 
pulled hence kolla-ansible is not able to start the cluster due to this and 
giving 
"2020-10-20 08:11:52,517 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED! 
=> {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please 
start it using kolla-ansible
 mariadb_recovery"}
2020-10-20 08:11:52,551 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED! 
=> 
{"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start 
it using kolla-ansible
 mariadb_recovery"}"

- The continuation of with "kolla-ansible -i multinode mariadb_recovery" gives 
similar results
 
"TASK [mariadb : Stop MariaDB containers] 
********************************************************************************
*************************************************************
fatal: [osce3]: FAILED! => {"changed": false, "msg": "No such container: 
mariadb 
to stop"}"

Docker inspect shows on both nodes:
 docker inspect mariadb             
[
    {
        "CreatedAt": "2020-10-20T08:11:42+03:00",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/mariadb/_data",
        "Name": "mariadb",
        "Options": null,
        "Scope": "local"
    }
]

I also attach below the historygram of ansible log covering mariadb logic 
execution for issue #2 that highlights the strange abscene of kolla_docker: 
pull action for mariadb role.

ANSIBLE_LOG:

-FOLDERS and CONFIGS only

2020-10-20 08:11:25,033 p=151140 u=lineng n=ansible | TASK [mariadb :Ensuring config directories exist] *****************************
2020-10-20 08:11:26,755 p=151140 u=lineng n=ansible | TASK [mariadb : Ensuring database backup config directory exists] **************
2020-10-20 08:11:26,941 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over my.cnf for mariabackup] ***************************
2020-10-20 08:11:27,128 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over config.json files for services] *******************
2020-10-20 08:11:29,754 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over config.json files for mariabackup] ****************
2020-10-20 08:11:29,941 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over galera.cnf] ***************************************
2020-10-20 08:11:32,342 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over wsrep-notify.sh] **********************************
2020-10-20 08:11:35,117 p=151140 u=lineng n=ansible | TASK [mariadb : Copying over xinetd clustercheck.conf] *************************

-Check containers afterwards but this still doesn't do any pulling logic based on src code

2020-10-20 08:32:19,005 p=161054 u=lineng n=ansible | TASK [Check mariadb containers] 
2020-10-20 08:11:38,589 p=151140 u=lineng n=ansible | changed: [osce3] => (item={'key': 'mariadb', 'value': {'container_name': 'mariadb', 'group': 'mariadb', 'enabled': True, 'image'
: 'kolla/centos-source-mariadb:ussuri', 'volumes': ['/etc/kolla/mariadb/:/var/lib/kolla/config_files/:ro', '/etc/localtime:/etc/localtime:ro', '', 'mariadb:/var/lib/mysql', 'kolla_lo
gs:/var/log/kolla/'], 'dimensions': {}, 'haproxy': {'mariadb': {'enabled': True, 'mode': 'tcp', 'port': '3306', 'listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'time
out client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s', 'option httpchk'], 'custom_member_list': ['server osce3 192.168.1.6:3306 check port 4569  inter 2
000 rise 2 fall 5', 'server osce4 192.168.1.7:3306 check port 4569  inter 2000 rise 2 fall 5 backup', '']}, 'mariadb_external_lb': {'enabled': False, 'mode': 'tcp', 'port': '3306', '
listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'timeout client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s'], 'custom_member_list': ['ser
ver osce3 osce3:3306 check port 4569  inter 2000 rise 2 fall 5', 'server osce4 osce4:3306 check port 4569  inter 2000 rise 2 fall 5 backup', '']}}}})                                 
2020-10-20 08:11:38,759 p=151140 u=lineng n=ansible | changed: [osce4] => (item={'key': 'mariadb', 'value': {'container_name': 'mariadb', 'group': 'mariadb', 'enabled': True, 'image'
: 'kolla/centos-source-mariadb:ussuri', 'volumes': ['/etc/kolla/mariadb/:/var/lib/kolla/config_files/:ro', '/etc/localtime:/etc/localtime:ro', '', 'mariadb:/var/lib/mysql', 'kolla_lo
gs:/var/log/kolla/'], 'dimensions': {}, 'haproxy': {'mariadb': {'enabled': True, 'mode': 'tcp', 'port': '3306', 'listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'time
out client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s', 'option httpchk'], 'custom_member_list': ['server osce3 192.168.1.6:3306 check port 4569  inter 2
000 rise 2 fall 5', 'server osce4 192.168.1.7:3306 check port 4569  inter 2000 rise 2 fall 5 backup', '']}, 'mariadb_external_lb': {'enabled': False, 'mode': 'tcp', 'port': '3306', '
listen_port': '3306', 'frontend_tcp_extra': ['option clitcpka', 'timeout client 3600s'], 'backend_tcp_extra': ['option srvtcpka', 'timeout server 3600s'], 'custom_member_list': ['ser
ver osce3 osce3:3306 check port 4569  inter 2000 rise 2 fall 5', 'server osce4 osce4:3306 check port 4569  inter 2000 rise 2 fall 5 backup', '']}}}})                                 
2020-10-20 08:11:39,205 p=151140 u=lineng n=ansible | changed: [osce3] => (item={'key': 'mariadb-clustercheck', 'value': {'container_name': 'mariadb_clustercheck', 'group': 'mariadb'
, 'enabled': True, 'image': 'kolla/centos-source-mariadb-clustercheck:ussuri', 'volumes': ['/etc/kolla/mariadb-clustercheck/:/var/lib/kolla/config_files/:ro', '/etc/localtime:/etc/lo
caltime:ro', '', 'kolla_logs:/var/log/kolla/'], 'dimensions': {}, 'environment': {'MYSQL_USERNAME': 'haproxy', 'MYSQL_PASSWORD': '', 'MYSQL_HOST': '192.168.1.6', 'AVAILABLE_WHEN_DONO
R': '1'}}})

- Finishing up with mode docker tasks for volumes

2020-10-20 08:11:40,187 p=151140 u=lineng n=ansible | TASK [mariadb : Create MariaDB volume] *****************************************
2020-10-20 08:11:40,988 p=151140 u=lineng n=ansible | TASK [mariadb : Divide hosts by their MariaDB volume availability] *************
2020-10-20 08:11:41,181 p=151140 u=lineng n=ansible | TASK [mariadb : Establish whether the cluster has already existed] *************

- And service port liveliness

2020-10-20 08:11:41,385 p=151140 u=lineng n=ansible | TASK [mariadb : Check MariaDB service port liveness] ***************************
2020-10-20 08:11:51,989 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.1
.6:3306"}
2020-10-20 08:11:51,989 p=151140 u=lineng n=ansible | ...ignoring
2020-10-20 08:11:52,145 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.1
.7:3306"}
2020-10-20 08:11:52,146 p=151140 u=lineng n=ansible | ...ignoring
2020-10-20 08:11:52,225 p=151140 u=lineng n=ansible | TASK [mariadb : Divide hosts by their MariaDB service port liveness] ***********
2020-10-20 08:11:52,319 p=151140 u=lineng n=ansible | changed: [osce3]
2020-10-20 08:11:52,344 p=151140 u=lineng n=ansible | changed: [osce4]
2020-10-20 08:11:52,424 p=151140 u=lineng n=ansible | TASK [mariadb : Fail on existing but stopped cluster] **************************
2020-10-20 08:11:52,517 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible
 mariadb_recovery"}
2020-10-20 08:11:52,551 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible
 mariadb_recovery"}
2020-10-20 08:11:52,552 p=151140 u=lineng n=ansible | RUNNING HANDLER [mariadb : Restart MariaDB on existing cluster members] ********
2020-10-20 08:11:52,552 p=151140 u=lineng n=ansible | RUNNING HANDLER [mariadb : Start MariaDB on new nodes] *************************
2020-10-20 08:11:52,553 p=151140 u=lineng n=ansible | RUNNING HANDLER [Restart mariadb-clustercheck container] ***********************
2020-10-20 08:11:52,554 p=151140 u=lineng n=ansible | PLAY RECAP *********************************************************************

Thanks and let me know if I should attach inventory, globals and all.yml for complete picture?

Cheers

Revision history for this message

Sean Murphy (murp) wrote on 2020-10-20:

#11

I reported this issue originally and we have our approach (which is basically what the recovery process does, but we like to do it manually so we can see what is going on). Hence I have no specific inputs on whether this issue should be closed right now.

We did have an issue with Ussuri upgrade on CentOS - not sure if this is related to your case. We were using CentOS 7 as the host OS but the Ussuri containers assume CentOS 8 - CentOS 8 assumes nftables but this is not supported in CentOS 7. We ended up doing an ugly rollback to Train and now we're looking at a way to get out of the CentOS world (as we're not sure that podman will be sufficiently compatible with docker such that kolla upgrades will be smooth...).

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-10-20:

#12

We are still using Docker on CentOS 8, even in Victoria. The "trick" with T(C7)->U(C8) migration is to migrate to CentOS 8 first using Train: https://docs.openstack.org/kolla-ansible/train/user/centos8.html

Revision history for this message

Dmitry Rachkov (dmitry-rachkov) wrote on 2020-10-20:

#13

Hi. In my deployment I'm using recommended centos 8.2 for all machines deployment_node included.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2020-10-20:

#14

Dmitry,

Although this fails...

... the failure is ignored.

2020-10-20 08:11:52,225 p=151140 u=lineng n=ansible | TASK [mariadb : Divide hosts by their MariaDB service port liveness] ***********
2020-10-20 08:11:52,319 p=151140 u=lineng n=ansible | changed: [osce3]
2020-10-20 08:11:52,344 p=151140 u=lineng n=ansible | changed: [osce4]

Here is where it actually fails:

2020-10-20 08:11:52,424 p=151140 u=lineng n=ansible | TASK [mariadb : Fail on existing but stopped cluster] **************************
2020-10-20 08:11:52,517 p=151140 u=lineng n=ansible | fatal: [osce3]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible
mariadb_recovery"}
2020-10-20 08:11:52,551 p=151140 u=lineng n=ansible | fatal: [osce4]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible
mariadb_recovery"}

This check fails when at least one node has a mariadb docker volume, but no node has a mariadb container running.

If you are using the kolla-ansible destroy command to clean up, there is an issue where it only removes volumes that are mounted by existing containers. If you have a stale mariadb volume, this could be causing the issue. Try removing it.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-10-20:

#15

Maybe destroy should look for known volumes to remove as well?

Revision history for this message

Dmitry Rachkov (dmitry-rachkov) wrote on 2020-10-20:

#16

Mark,

Thank you very much, you totally made my day :) - that was exactly it these `fireworks` were caused by dangling volumes! It is resolved now.

Guys, can I propose adding cleaning of dangling volumes as part of the destroy scenario? If so, is there already a feature request or smb who is working on that? If not I can do a pull request, I already checked the source code this should be quite easy?

Thanks

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-10-20:

#17

Please propose (can be patch here as Gerrit is down).

Revision history for this message

Mark Goddard (mgoddard) wrote on 2020-10-21:

#18

Great. The fiddly part will be destroying only volumes created by kolla. For containers we can filter by label, but this is not possible for volumes. That is why we look for volumes mounted by kolla containers.

A better solution would be to implement destroy via ansible, on a per-service basis. There is an old patch for this, but it will need some love: https://review.opendev.org/504592

Revision history for this message

Greg Dulin (gregory-dulin) wrote on 2020-11-04:

#19

I ran into this issue too while using Kayobe (train) to redeploy on previously used hosts. Using the `--wipe-disks` flag (as recommended in the Kayobe deployment documentation) fixed it for me:

kayobe overcloud host configure --wipe-disks
https://docs.openstack.org/kayobe/train/deployment.html

Thanks!

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.