Stein Debian MariaDB cluster failed to deploy on multi-node

Bug #1867750 reported by xinliang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
Fix Released
Medium
Unassigned

Bug Description

kolla-ansible deploy log:

RUNNING HANDLER [mariadb : wait for slave mariadb] ***********************************************************************************************************************************
skipping: [10.101.16.18]
FAILED - RETRYING: wait for slave mariadb (10 retries left).
FAILED - RETRYING: wait for slave mariadb (9 retries left).
FAILED - RETRYING: wait for slave mariadb (10 retries left).
FAILED - RETRYING: wait for slave mariadb (9 retries left).
FAILED - RETRYING: wait for slave mariadb (8 retries left).
FAILED - RETRYING: wait for slave mariadb (7 retries left).
FAILED - RETRYING: wait for slave mariadb (8 retries left).
FAILED - RETRYING: wait for slave mariadb (7 retries left).
FAILED - RETRYING: wait for slave mariadb (6 retries left).
FAILED - RETRYING: wait for slave mariadb (6 retries left).
FAILED - RETRYING: wait for slave mariadb (5 retries left).
FAILED - RETRYING: wait for slave mariadb (5 retries left).
FAILED - RETRYING: wait for slave mariadb (4 retries left).
FAILED - RETRYING: wait for slave mariadb (4 retries left).
FAILED - RETRYING: wait for slave mariadb (3 retries left).
FAILED - RETRYING: wait for slave mariadb (3 retries left).
FAILED - RETRYING: wait for slave mariadb (2 retries left).
FAILED - RETRYING: wait for slave mariadb (1 retries left).
FAILED - RETRYING: wait for slave mariadb (2 retries left).
fatal: [10.101.16.24]: FAILED! => {"attempts": 10, "changed": false, "module_stderr": "Shared connection to 10.101.16.24 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434030.53-96925213156540/AnsiballZ_wait_for.py\", line 113, in <module>\r\n _ansiballz_main()\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434030.53-96925213156540/AnsiballZ_wait_for.py\", line 105, in _ansiballz_main\r\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434030.53-96925213156540/AnsiballZ_wait_for.py\", line 48, in invoke_module\r\n imp.load_module('__main__', mod, module, MOD_DESC)\r\n File \"/tmp/ansible_wait_for_payload_eEGcKI/__main__.py\", line 629, in <module>\r\n File \"/tmp/ansible_wait_for_payload_eEGcKI/__main__.py\", line 558, in main\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
FAILED - RETRYING: wait for slave mariadb (1 retries left).
fatal: [10.101.16.120]: FAILED! => {"attempts": 10, "changed": false, "module_stderr": "Shared connection to 10.101.16.120 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434060.67-28335586015662/AnsiballZ_wait_for.py\", line 113, in <module>\r\n _ansiballz_main()\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434060.67-28335586015662/AnsiballZ_wait_for.py\", line 105, in _ansiballz_main\r\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434060.67-28335586015662/AnsiballZ_wait_for.py\", line 48, in invoke_module\r\n imp.load_module('__main__', mod, module, MOD_DESC)\r\n File \"/tmp/ansible_wait_for_payload_jGfmU0/__main__.py\", line 629, in <module>\r\n File \"/tmp/ansible_wait_for_payload_jGfmU0/__main__.py\", line 558, in main\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}RUNNING HANDLER [mariadb : run upgrade on slave] *************************************************************************************************************************************
skipping: [10.101.16.18]RUNNING HANDLER [mariadb : remove restart policy from master mariadb] ****************************************************************************************************************
changed: [10.101.16.18]RUNNING HANDLER [mariadb : shutdown master mariadb] **********************************************************************************************************************************
changed: [10.101.16.18]RUNNING HANDLER [mariadb : wait for master mariadb shutdown] *************************************************************************************************************************
FAILED - RETRYING: wait for master mariadb shutdown (30 retries left).
ok: [10.101.16.18]RUNNING HANDLER [mariadb : stop mariadb master] **************************************************************************************************************************************
skipping: [10.101.16.18]RUNNING HANDLER [mariadb : replace restart policy on master mariadb] *****************************************************************************************************************
changed: [10.101.16.18]RUNNING HANDLER [mariadb : restart master mariadb] ***********************************************************************************************************************************
changed: [10.101.16.18]RUNNING HANDLER [mariadb : Waiting for master mariadb] *******************************************************************************************************************************
FAILED - RETRYING: Waiting for master mariadb (10 retries left).
FAILED - RETRYING: Waiting for master mariadb (9 retries left).
FAILED - RETRYING: Waiting for master mariadb (8 retries left).
FAILED - RETRYING: Waiting for master mariadb (7 retries left).
FAILED - RETRYING: Waiting for master mariadb (6 retries left).
FAILED - RETRYING: Waiting for master mariadb (5 retries left).
FAILED - RETRYING: Waiting for master mariadb (4 retries left).
FAILED - RETRYING: Waiting for master mariadb (3 retries left).
FAILED - RETRYING: Waiting for master mariadb (2 retries left).
FAILED - RETRYING: Waiting for master mariadb (1 retries left).
fatal: [10.101.16.18]: FAILED! => {"attempts": 10, "changed": false, "module_stderr": "Shared connection to 10.101.16.18 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434448.57-23221020123368/AnsiballZ_wait_for.py\", line 113, in <module>\r\n _ansiballz_main()\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434448.57-23221020123368/AnsiballZ_wait_for.py\", line 105, in _ansiballz_main\r\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\r\n File \"/home/linaro/.ansible/tmp/ansible-tmp-1584434448.57-23221020123368/AnsiballZ_wait_for.py\", line 48, in invoke_module\r\n imp.load_module('__main__', mod, module, MOD_DESC)\r\n File \"/tmp/ansible_wait_for_payload_FG3u2r/__main__.py\", line 629, in <module>\r\n File \"/tmp/ansible_wait_for_payload_FG3u2r/__main__.py\", line 558, in main\r\nsocket.error: [Errno 104] Connection reset by peer\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}RUNNING HANDLER [mariadb : run upgrade on master] ************************************************************************************************************************************NO MORE HOSTS LEFT *******************************************************************************************************************************************************************
        to retry, use: --limit @/home/linaro/liuxl/kolla-ansible/ansible/site.retryPLAY RECAP ***************************************************************************************************************************************************************************
10.101.16.120 : ok=49 changed=24 unreachable=0 failed=1
10.101.16.18 : ok=102 changed=54 unreachable=0 failed=1
10.101.16.24 : ok=83 changed=49 unreachable=0 failed=1
localhost : ok=5 changed=0 unreachable=0 failed=0Command failed ansible-playbook -i stein_multinode -e @/etc/kolla/globals.yml -e @/etc/kolla/passwords.yml -e CONFIG_DIR=/etc/kolla -e kolla_action=deploy /home/linaro/liuxl/kolla-ansible/ansible/site.yml

------------------------------------------------------------------------------
mariadb.log:

Error in my_thread_global_end(): 1 threads didn't exit
2020-03-17 16:33:02 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2020-03-17 16:33:02 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2020-03-17 16:33:02 0 [Note] WSREP: wsrep_load(): Galera 3.25(rddf9876) by Codership Oy <email address hidden> loaded successfully.
2020-03-17 16:33:02 0 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2020-03-17 16:33:02 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 1
2020-03-17 16:33:02 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.101.16.24; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period =
PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M;
 gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://10.101.16.24:4567; gmcast.segment = 0; g
2020-03-17 16:33:02 0 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 00000000-0000-0000-0000-000000000000:-1
2020-03-17 16:33:02 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2020-03-17 16:33:02 0 [Note] WSREP: wsrep_sst_grab()
2020-03-17 16:33:02 0 [Note] WSREP: Start replication
2020-03-17 16:33:02 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2020-03-17 16:33:02 0 [Note] WSREP: protonet asio version 0
2020-03-17 16:33:02 0 [Note] WSREP: Using CRC-32C for message checksums.
2020-03-17 16:33:02 0 [Note] WSREP: backend: asio
2020-03-17 16:33:02 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2020-03-17 16:33:02 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2020-03-17 16:33:02 0 [Note] WSREP: restore pc from disk failed
2020-03-17 16:33:02 0 [Note] WSREP: GMCast version 0
2020-03-17 16:33:02 0 [Note] WSREP: (eac4fba7, 'tcp://10.101.16.24:4567') listening at tcp://10.101.16.24:4567
2020-03-17 16:33:02 0 [Note] WSREP: (eac4fba7, 'tcp://10.101.16.24:4567') multicast: , ttl: 1
2020-03-17 16:33:02 0 [Note] WSREP: EVS version 0
2020-03-17 16:33:02 0 [Note] WSREP: gcomm: connecting to group 'openstack', peer '10.101.16.18:4567,10.101.16.24:4567,10.101.16.120:4567'
2020-03-17 16:33:02 0 [Note] WSREP: (eac4fba7, 'tcp://10.101.16.24:4567') connection established to 6f169a34 tcp://10.101.16.18:4567
2020-03-17 16:33:02 0 [Note] WSREP: (eac4fba7, 'tcp://10.101.16.24:4567') turning message relay requesting on, nonlive peers:
2020-03-17 16:33:02 0 [Note] WSREP: (eac4fba7, 'tcp://10.101.16.24:4567') connection established to e5e3aa99 tcp://10.101.16.120:4567
2020-03-17 16:33:03 0 [Note] WSREP: declaring 6f169a34 at tcp://10.101.16.18:4567 stable 2020-03-17 16:33:03 0 [Note] WSREP: Node 6f169a34 state prim
2020-03-17 16:33:03 0 [Note] WSREP: view(view_id(PRIM,6f169a34,31) memb {
        6f169a34,0 eac4fba7,0
} joined {
} left { } partitioned {
})
2020-03-17 16:33:03 0 [Note] WSREP: save pc into disk
2020-03-17 16:33:03 0 [Note] WSREP: forgetting e5e3aa99 (tcp://10.101.16.120:4567)
2020-03-17 16:33:03 0 [Note] WSREP: gcomm: connected
2020-03-17 16:33:03 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2020-03-17 16:33:03 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2020-03-17 16:33:03 0 [Note] WSREP: Opened channel 'openstack'
2020-03-17 16:33:03 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2020-03-17 16:33:03 0 [Note] WSREP: Waiting for SST to complete.
2020-03-17 16:33:03 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2020-03-17 16:33:03 0 [Note] WSREP: STATE EXCHANGE: sent state msg: eb14e9b1-6829-11ea-b7d0-e7c5e997c4c1
2020-03-17 16:33:03 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb14e9b1-6829-11ea-b7d0-e7c5e997c4c1 from 0 (j12-d05-06)
2020-03-17 16:33:03 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb14e9b1-6829-11ea-b7d0-e7c5e997c4c1 from 1 (j12-d05)
2020-03-17 16:33:03 0 [Note] WSREP: Quorum results:
        version = 4,
        component = PRIMARY,
        conf_id = 30,
        members = 1/2 (joined/total),
        act_id = 0,
        last_appl. = -1,
        protocols = 0/9/3 (gcs/repl/appl),
        group UUID = 6f191617-6829-11ea-9b39-46288999ce82
2020-03-17 16:33:03 0 [Note] WSREP: Flow-control interval: [23, 23]
2020-03-17 16:33:03 0 [Note] WSREP: Trying to continue unpaused monitor
2020-03-17 16:33:03 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 0)
2020-03-17 16:33:03 2 [Note] WSREP: State transfer required:
        Group state: 6f191617-6829-11ea-9b39-46288999ce82:0
        Local state: 00000000-0000-0000-0000-000000000000:-1
2020-03-17 16:33:03 2 [Note] WSREP: New cluster view: global state: 6f191617-6829-11ea-9b39-46288999ce82:0, view# 31: Primary, number of nodes: 2, my index: 1, protocol version 3
2020-03-17 16:33:03 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2020-03-17 16:33:03 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.101.16.24:4444' --datadir '/var/lib/mysql/' --parent '229' --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/aarch64-linux-gnu/mariadb19/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_on=ON --log-erro
r=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [ERROR] mariabackup binary not found in $PATH (20200317 16:33:04.021)
2020-03-17 16:33:04 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address '10.101.16.24:4444' --datadir '/var/lib/mysql/' --parent '229' --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/aarch64-linux-gnu/mariadb19/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.
so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
        Read: '(null)'
2020-03-17 16:33:04 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.101.16.24:4444' --datadir '/var/lib/mysql/' --parent '229' --binlog 'mysql-bin' --mysqld-args --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib/aarch64-linux-gnu/mariadb19/plugin --wsrep_provider=/usr/lib/galera/libgalera_smm.so --w
srep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1: 42 (No message of desired type)
2020-03-17 16:33:04 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
2020-03-17 16:33:04 2 [ERROR] Aborting

xinliang (xin3liang)
affects: kolla → kolla-ansible
affects: kolla-ansible → kolla
Changed in kolla:
status: New → Confirmed
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

You could try this for more resilient deploy: https://review.opendev.org/706078

However, this looks like issue with images. @Marcin for aarch64.

Changed in kolla:
assignee: nobody → Marcin Juszkiewicz (hrw)
Revision history for this message
xinliang (xin3liang) wrote :

From mariadb.log "[ERROR] mariabackup binary not found in $PATH", it seems it is lacking of mariabackup which from mariadb-backup package.

Revision history for this message
Marcin Juszkiewicz (hrw) wrote :

Looks like we need to backport https://review.opendev.org/#/c/681694/ where we moved to use mariadb-backup instead of percona xtrabackup

Revision history for this message
Marcin Juszkiewicz (hrw) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/713647

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/stein)

Reviewed: https://review.opendev.org/713647
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=1a0886f84fa1fbee91dfeff628f4e2e77c9ef7a7
Submitter: Zuul
Branch: stable/stein

commit 1a0886f84fa1fbee91dfeff628f4e2e77c9ef7a7
Author: Marcin Juszkiewicz <email address hidden>
Date: Wed Mar 18 13:52:14 2020 +0100

    mariadb: install mariabackup on Debian

    On AArch64 systems we are unable to deploy Stein due to lack of
    mariabackup command.

    Change-Id: I00b8c945470a0f63f23f2d0da74eed3189f5d9e5
    Closes-bug: #1867750

tags: added: in-stable-stein
Marcin Juszkiewicz (hrw)
Changed in kolla:
milestone: none → 8.0.3
status: Confirmed → Fix Committed
importance: Undecided → Medium
Mark Goddard (mgoddard)
Changed in kolla:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.