mariadb multinode upgrade broken ocata-pike

Bug #1692507 reported by Eduardo Gonzalez
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Critical
Unassigned
Ocata
Triaged
Critical
Unassigned
Pike
Fix Released
Critical
Unassigned

Bug Description

While doing a upgrade from ocata to master(pike) in multinode environment, mariadb containers keep restarting.

170522 12:33:48 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/
170522 12:33:48 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql//wsrep_recovery.c6LmRh' --pid-file='/var/lib/mysql//controller2-recover.pid'
nohup: ignoring input
170522 12:33:48 [Note] /usr/sbin/mysqld (mysqld 10.0.30-MariaDB-wsrep) starting as process 180 ...
170522 12:33:51 mysqld_safe WSREP: Recovered position 7ca85504-3ee1-11e7-8b89-d2201d697ea9:7
170522 12:33:51 [Note] WSREP: Read nil XID from storage engines, skipping position init
170522 12:33:51 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
170522 12:33:51 [Note] /usr/sbin/mysqld (mysqld 10.0.30-MariaDB-wsrep) starting as process 219 ...
170522 12:33:51 [Note] WSREP: wsrep_load(): Galera 25.3.19(r3667) by Codership Oy <email address hidden> loaded successfully.
170522 12:33:51 [Note] WSREP: CRC-32C: using hardware acceleration.
170522 12:33:51 [Note] WSREP: Found saved state: 7ca85504-3ee1-11e7-8b89-d2201d697ea9:-1, safe_to_bootsrap: 0
170522 12:33:51 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.100.186; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://192.168.100.186:4567; gmcast.segment = 0; gmcast.version = 0; i
170522 12:33:51 [Note] WSREP: GCache history reset: old(7ca85504-3ee1-11e7-8b89-d2201d697ea9:0) -> new(7ca85504-3ee1-11e7-8b89-d2201d697ea9:7)
170522 12:33:51 [Note] WSREP: Assign initial position for certification: 7, protocol version: -1
170522 12:33:51 [Note] WSREP: wsrep_sst_grab()
170522 12:33:51 [Note] WSREP: Start replication
170522 12:33:51 [Note] WSREP: Setting initial position to 7ca85504-3ee1-11e7-8b89-d2201d697ea9:7
170522 12:33:51 [Note] WSREP: protonet asio version 0
170522 12:33:51 [Note] WSREP: Using CRC-32C for message checksums.
170522 12:33:51 [Note] WSREP: backend: asio
170522 12:33:51 [Note] WSREP: gcomm thread scheduling priority set to other:0
170522 12:33:51 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170522 12:33:51 [Note] WSREP: restore pc from disk failed
170522 12:33:51 [Note] WSREP: GMCast version 0
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') listening at tcp://192.168.100.186:4567
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') multicast: , ttl: 1
170522 12:33:51 [Note] WSREP: EVS version 0
170522 12:33:51 [Note] WSREP: gcomm: connecting to group 'openstack', peer '192.168.100.244:4567,192.168.100.186:4567'
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') connection established to 87cf69fd tcp://192.168.100.244:4567
170522 12:33:51 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') turning message relay requesting on, nonlive peers:
170522 12:33:51 [Note] WSREP: declaring 87cf69fd at tcp://192.168.100.244:4567 stable
170522 12:33:51 [Warning] WSREP: no nodes coming from prim view, prim not possible
170522 12:33:51 [Note] WSREP: view(view_id(NON_PRIM,8783922d,1) memb {
 8783922d,0
 87cf69fd,0
} joined {
} left {
} partitioned {
})
170522 12:33:55 [Note] WSREP: (8783922d, 'tcp://192.168.100.186:4567') turning message relay requesting off
170522 12:34:22 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
  at gcomm/src/pc.cpp:connect():158
170522 12:34:22 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
170522 12:34:22 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'openstack' at 'gcomm://192.168.100.244:4567,192.168.100.186:4567': -110 (Connection timed out)
170522 12:34:22 [ERROR] WSREP: gcs connect failed: Connection timed out
170522 12:34:22 [ERROR] WSREP: wsrep::connect(gcomm://192.168.100.244:4567,192.168.100.186:4567) failed: 7
170522 12:34:22 [ERROR] Aborting

170522 12:34:22 [Note] WSREP: Service disconnected.
170522 12:34:23 [Note] WSREP: Some threads may fail to exit.
170522 12:34:23 [Note] /usr/sbin/mysqld: Shutdown complete

170522 12:34:23 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended

Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote :

Tried several times today with ansible2.3 and ansible2.1.0.
Cannot find the root cause of this issue

Debug logs for deployment: http://paste.openstack.org/show/610628/

Mariadb package versions are the same in both Ocata and master images (mariadb 10.0.30)

Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote :

Allinone upgrade worked.

Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote : Re: mariadb multinode fail to upgrade ocata->pike

Upgrade from newton to ocata worked.
Upgrade from ocata to pike(master) fail.

I dont realy know the reason of this, nothing changed in mariadb role since then, and mariadb version is the same.

description: updated
summary: - mariadb fail to upgrade ocata->master
+ mariadb multinode fail to upgrade ocata->pike
Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote :

Enabling serial in mariadb fixes the issue. The problem is that restarting all mariadb containers at the same time loose cluster quorum.

Using this:
export ANSIBLE_SERIAL=1

And applying this patch:

diff --git a/ansible/site.yml b/ansible/site.yml
index 9b705a3..8368e9d 100644
--- a/ansible/site.yml
+++ b/ansible/site.yml
@@ -133,6 +133,7 @@
 - name: Apply role mariadb
   gather_facts: false
   hosts: mariadb
+ serial: '{{ serial|default("0") }}'
   roles:
     - { role: mariadb,
         tags: mariadb,

Changed in kolla-ansible:
importance: Undecided → Critical
summary: - mariadb multinode fail to upgrade ocata->pike
+ mariadb multinode fail to upgrade ocata->pike due missing serial
summary: - mariadb multinode fail to upgrade ocata->pike due missing serial
+ mariadb multinode fail to upgrade ocata->pike
Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote : Re: optimize reconfiguration breaks mariadb upgrade

Looking into mariadb role changes, seems that optimized reconfiguration breaks mariadb cluster lookup order while upgrading and do a restart of all mariadb containers at the same time causing cluster quorum issues.

TASK [mariadb : include] ***********************************************************************************************************************
included: /root/kolla-ansible/ansible/roles/mariadb/tasks/lookup_cluster.yml for 192.168.100.244, 192.168.100.186

TASK [mariadb : Cleaning up temp file on localhost] ********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Creating temp file on localhost] ***********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : Creating mariadb volume] *******************************************************************************************************
ok: [192.168.100.244]
ok: [192.168.100.186]

TASK [mariadb : Writing hostname of host with existing cluster files to temp file] *************************************************************
ok: [192.168.100.244 -> localhost]
ok: [192.168.100.186 -> localhost]

TASK [mariadb : Registering host from temp file] ***********************************************************************************************
ok: [192.168.100.244]
ok: [192.168.100.186]

TASK [mariadb : Cleaning up temp file on localhost] ********************************************************************************************
ok: [192.168.100.244 -> localhost]

TASK [mariadb : include] ***********************************************************************************************************************
included: /root/kolla-ansible/ansible/roles/mariadb/tasks/start.yml for 192.168.100.244, 192.168.100.186

TASK [mariadb : Starting mariadb container] ****************************************************************************************************
changed: [192.168.100.244]
changed: [192.168.100.186]

summary: - mariadb multinode fail to upgrade ocata->pike
+ optimize reconfiguration breaks mariadb upgrade
summary: - optimize reconfiguration breaks mariadb upgrade
+ optimize reconfiguration breaks multinode mariadb upgrade
Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote : Re: optimize reconfiguration breaks multinode mariadb upgrade

mariadb optimize reconfiguration was not merged yet, so the issue is with the serial removal.
https://review.openstack.org/#/c/433480/

summary: - optimize reconfiguration breaks multinode mariadb upgrade
+ mariadb multinode upgrade broken ocata-pike
Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote :
Changed in kolla-ansible:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by Eduardo Gonzalez (<email address hidden>) on branch: master
Review: https://review.openstack.org/485217
Reason: Fixed with https://review.openstack.org/#/c/433480/

Revision history for this message
Serge Radinovich (srad015) wrote :

I wiped /etc/ansible and re-installed kolla-ansible to fix this issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.