kolla-ansible

mariadb multinode upgrade broken ocata-pike

Bug #1692507 reported by Eduardo Gonzalez on 2017-05-22

This bug affects 3 people

	Status	Importance	Assigned to
kolla-ansible	Fix Released	Critical	Unassigned
Ocata	Triaged	Critical	Unassigned
Pike	Fix Released	Critical	Unassigned

Bug Description

While doing a upgrade from ocata to master(pike) in multinode environment, mariadb containers keep restarting.

170522 12:33:48 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/
170522 12:33:48 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql//wsrep_recovery.c6LmRh' --pid-file='/var/lib/mysql//controller2-recover.pid'
nohup: ignoring input
170522 12:33:48 [Note] /usr/sbin/mysqld (mysqld 10.0.30-MariaDB-wsrep) starting as process 180 ...
170522 12:33:51 mysqld_safe WSREP: Recovered position 7ca85504-3ee1-11e7-8b89-d2201d697ea9:7
170522 12:33:51 [Note] WSREP: Read nil XID from storage engines, skipping position init
170522 12:33:51 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
170522 12:33:51 [Note] /usr/sbin/mysqld (mysqld 10.0.30-MariaDB-wsrep) starting as process 219 ...
170522 12:33:51 [Note] WSREP: wsrep_load(): Galera 25.3.19(r3667) by Codership Oy <email address hidden> loaded successfully.
170522 12:33:51 [Note] WSREP: CRC-32C: using hardware acceleration.
170522 12:33:51 [Note] WSREP: Found saved state: 7ca85504-3ee1-11e7-8b89-d2201d697ea9:-1, safe_to_bootsrap: 0
170522 12:33:51 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.100.186; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://192.168.100.186:4567; gmcast.segment = 0; gmcast.version = 0; i
170522 12:33:51 [Note] WSREP: GCache history reset: old(7ca85504-3ee1-11e7-8b89-d2201d697ea9:0) -> new(7ca85504-3ee1-11e7-8b89-d2201d697ea9:7)
170522 12:33:51 [Note] WSREP: Assign initial position for certification: 7, protocol version: -1
170522 12:33:51 [Note] WSREP: wsrep_sst_grab()
170522 12:33:51 [Note] WSREP: Start replication
170522 12:33:51 [Note] WSREP: Setting initial position to 7ca85504-3ee1-11e7-8b89-d2201d697ea9:7
170522 12:33:51 [Note] WSREP: protonet asio version 0
170522 12:33:51 [Note] WSREP: Using CRC-32C for message checksums.
170522 12:33:51 [Note] WSREP: backend: asio
170522 12:33:51 [Note] WSREP: gcomm thread scheduling priority set to other:0
170522 12:33:51 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170522 12:33:51 [Note] WSREP: restore pc from disk failed
170522 12:33:51 [Note] WSREP: GMCast version 0
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') listening at tcp://192.168.100.186:4567
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') multicast: , ttl: 1
170522 12:33:51 [Note] WSREP: EVS version 0
170522 12:33:51 [Note] WSREP: gcomm: connecting to group 'openstack', peer '192.168.100.244:4567,192.168.100.186:4567'
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') connection established to 87cf69fd tcp://192.168.100.244:4567
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') turning message relay requesting on, nonlive peers:
170522 12:33:51 [Note] WSREP: declaring 87cf69fd at tcp://192.168.100.244:4567 stable
170522 12:33:51 [Warning] WSREP: no nodes coming from prim view, prim not possible
170522 12:33:51 [Note] WSREP: view(view_id(NON_PRIM,8783922d,1) memb {
8783922d,0
87cf69fd,0
} joined {
} left {
} partitioned {
})
170522 12:33:55 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') turning message relay requesting off
170522 12:34:22 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():158
170522 12:34:22 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
170522 12:34:22 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'openstack' at 'gcomm://192.168.100.244:4567,192.168.100.186:4567': -110 (Connection timed out)
170522 12:34:22 [ERROR] WSREP: gcs connect failed: Connection timed out
170522 12:34:22 [ERROR] WSREP: wsrep::connect(gcomm://192.168.100.244:4567,192.168.100.186:4567) failed: 7
170522 12:34:22 [ERROR] Aborting

170522 12:34:22 [Note] WSREP: Service disconnected.
170522 12:34:23 [Note] WSREP: Some threads may fail to exit.
170522 12:34:23 [Note] /usr/sbin/mysqld: Shutdown complete

170522 12:34:23 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended

See original description

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2017-05-25:

Tried several times today with ansible2.3 and ansible2.1.0.
Cannot find the root cause of this issue

Debug logs for deployment: http://paste.openstack.org/show/610628/

Mariadb package versions are the same in both Ocata and master images (mariadb 10.0.30)

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2017-05-25:

Allinone upgrade worked.

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2017-06-22: Re: mariadb multinode fail to upgrade ocata->pike

Upgrade from newton to ocata worked.
Upgrade from ocata to pike(master) fail.

I dont realy know the reason of this, nothing changed in mariadb role since then, and mariadb version is the same.

description:	updated
summary:	- mariadb fail to upgrade ocata->master + mariadb multinode fail to upgrade ocata->pike

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2017-06-22:

Enabling serial in mariadb fixes the issue. The problem is that restarting all mariadb containers at the same time loose cluster quorum.

Using this:
export ANSIBLE_SERIAL=1

And applying this patch:

diff --git a/ansible/site.yml b/ansible/site.yml
index 9b705a3..8368e9d 100644
--- a/ansible/site.yml
+++ b/ansible/site.yml
@@ -133,6 +133,7 @@
- name: Apply role mariadb
   gather_facts: false
   hosts: mariadb
+ serial: '{{ serial|default("0") }}'
   roles:
     - { role: mariadb,
         tags: mariadb,

Eduardo Gonzalez (egonzalez90) on 2017-07-05

Changed in kolla-ansible:
importance:	Undecided → Critical
summary:	- mariadb multinode fail to upgrade ocata->pike + mariadb multinode fail to upgrade ocata->pike due missing serial
summary:	- mariadb multinode fail to upgrade ocata->pike due missing serial + mariadb multinode fail to upgrade ocata->pike

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2017-07-05: Re: optimize reconfiguration breaks mariadb upgrade

Looking into mariadb role changes, seems that optimized reconfiguration breaks mariadb cluster lookup order while upgrading and do a restart of all mariadb containers at the same time causing cluster quorum issues.

TASK [mariadb : include] ***********************************************************************************************************************
included: /root/kolla-ansible/ansible/roles/mariadb/tasks/lookup_cluster.yml for 192.168.100.244, 192.168.100.186

TASK [mariadb : Cleaning up temp file on localhost] ********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Creating temp file on localhost] ***********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Creating mariadb volume] *******************************************************************************************************
ok: [192.168.100.244]
ok: [192.168.100.186]

TASK [mariadb : Writing hostname of host with existing cluster files to temp file] *************************************************************
ok: [192.168.100.244 -> localhost]
ok: [192.168.100.186 -> localhost]

TASK [mariadb : Registering host from temp file] ***********************************************************************************************
ok: [192.168.100.244]
ok: [192.168.100.186]

TASK [mariadb : Cleaning up temp file on localhost] ********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Starting mariadb container] ****************************************************************************************************
changed: [192.168.100.244]
changed: [192.168.100.186]

TASK [mariadb : Cleaning up temp file on localhost] ********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Creating temp file on localhost] ***********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Creating mariadb volume] *******************************************************************************************************
ok: [192.168.100.244]
ok: [192.168.100.186]

TASK [mariadb : Registering host from temp file] ***********************************************************************************************
ok: [192.168.100.244]
ok: [192.168.100.186]

TASK [mariadb : Cleaning up temp file on localhost] ********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Starting mariadb container] ****************************************************************************************************
changed: [192.168.100.244]
changed: [192.168.100.186]

summary:

- mariadb multinode fail to upgrade ocata->pike
+ optimize reconfiguration breaks mariadb upgrade

Eduardo Gonzalez (egonzalez90) on 2017-07-05

summary:

- optimize reconfiguration breaks mariadb upgrade
+ optimize reconfiguration breaks multinode mariadb upgrade

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2017-07-05: Re: optimize reconfiguration breaks multinode mariadb upgrade

mariadb optimize reconfiguration was not merged yet, so the issue is with the serial removal.
https://review.openstack.org/#/c/433480/

summary:

- optimize reconfiguration breaks multinode mariadb upgrade
+ mariadb multinode upgrade broken ocata-pike

Revision history for this message

Eduardo Gonzalez (egonzalez90) wrote on 2017-07-24:

Fixed with https://review.openstack.org/#/c/433480/

Changed in kolla-ansible:
status:	New → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-07-24: Change abandoned on kolla-ansible (master)

Change abandoned by Eduardo Gonzalez (<email address hidden>) on branch: master
Review: https://review.openstack.org/485217
Reason: Fixed with https://review.openstack.org/#/c/433480/

Revision history for this message

Serge Radinovich (srad015) wrote on 2017-11-25:

I wiped /etc/ansible and re-installed kolla-ansible to fix this issue.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1674365

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.