mysql does not start on boot post-upgrade

Bug #1336110 reported by Adam Gandelman on 2014-07-01
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Gregory Haynes

Bug Description

Attempting to do image based overcloud upgrades. Heat stack is updated, nodes are rebuilt with REBUILD_PRESERVE_EPHEMERAL rebuild policy. After booting the new image with existing persistent data, mysql fails to start breaking the entire upgrade.

140701 00:43:42 mysqld_safe Starting mysqld daemon with databases from /mnt/state/var/lib/mysql/
140701 00:43:42 mysqld_safe WSREP: Running position recovery with --log_error='/mnt/state/var/lib/mysql//wsrep_recovery.BIlzcZ' --pid-file='/mnt/state/var/lib/mysql//overcloud-controller0-7ainlyzwh2up-recover.pid'
140701 00:43:45 mysqld_safe WSREP: Recovered position 9baa9bfc-00b1-11e4-a51e-ffdbb4ea96fd:3853
140701 0:43:45 [Note] WSREP: wsrep_start_position var submitted: '9baa9bfc-00b1-11e4-a51e-ffdbb4ea96fd:3853'
140701 0:43:45 [Note] WSREP: Read nil XID from storage engines, skipping position init
140701 0:43:45 [Note] WSREP: wsrep_load(): loading provider library '/usr/local/mysql/lib/libgalera_smm.so'
140701 0:43:45 [Note] WSREP: wsrep_load(): Galera 2.10(r175) by Codership Oy <email address hidden> loaded successfully.
140701 0:43:45 [Note] WSREP: Found saved state: 9baa9bfc-00b1-11e4-a51e-ffdbb4ea96fd:-1
140701 0:43:45 [Note] WSREP: Reusing existing '/mnt/state/var/lib/mysql//galera.cache'.
140701 0:43:45 [Note] WSREP: Passing config to GCS: base_host = 10.22.167.75; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /mnt/state/var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/state/var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; protone
140701 0:43:45 [Note] WSREP: Assign initial position for certification: 3853, protocol version: -1
140701 0:43:45 [Note] WSREP: wsrep_sst_grab()
140701 0:43:45 [Note] WSREP: Start replication
140701 0:43:45 [Note] WSREP: Setting initial position to 9baa9bfc-00b1-11e4-a51e-ffdbb4ea96fd:3853
140701 0:43:45 [Note] WSREP: protonet asio version 0
140701 0:43:45 [Note] WSREP: backend: asio
140701 0:43:45 [Note] WSREP: GMCast version 0
140701 0:43:45 [Note] WSREP: (c1f0d0f7-00b8-11e4-999d-5f3c88f96e7e, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
140701 0:43:45 [Note] WSREP: (c1f0d0f7-00b8-11e4-999d-5f3c88f96e7e, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
140701 0:43:45 [Note] WSREP: EVS version 0
140701 0:43:45 [Note] WSREP: PC version 0
140701 0:43:45 [Note] WSREP: gcomm: connecting to group 'tripleo-tripleo-3zJwBWkT8Q', peer '10.22.167.75:'
140701 0:43:45 [Warning] WSREP: (c1f0d0f7-00b8-11e4-999d-5f3c88f96e7e, 'tcp://0.0.0.0:4567') address 'tcp://10.22.167.75:4567' points to own listening address, blacklisting
140701 0:43:48 [Warning] WSREP: no nodes coming from prim view, prim not possible
140701 0:43:48 [Note] WSREP: view(view_id(NON_PRIM,c1f0d0f7-00b8-11e4-999d-5f3c88f96e7e,1) memb {
        c1f0d0f7-00b8-11e4-999d-5f3c88f96e7e,
} joined {
} left {
} partitioned {
})
140701 0:43:49 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50629S), skipping check
140701 0:44:18 [Note] WSREP: view((empty))
140701 0:44:18 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():141
140701 0:44:18 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
140701 0:44:18 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1292: Failed to open channel 'tripleo-tripleo-3zJwBWkT8Q' at 'gcomm://10.22.167.75,': -110 (Connection timed out)
140701 0:44:18 [ERROR] WSREP: gcs connect failed: Connection timed out
140701 0:44:18 [ERROR] WSREP: wsrep::connect() failed: 7
140701 0:44:18 [ERROR] Aborting

140701 0:44:18 [Note] WSREP: Service disconnected.
140701 0:44:19 [Note] WSREP: Some threads may fail to exit.
140701 0:44:19 [Note] /usr/local/mysql/bin/mysqld: Shutdown complete

140701 00:44:19 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

As Greg explained, this is Galera clustering attempting to boostrap the cluster off of itself. This is in place to enable HA database clustering by default, however, its doing nothing for environments single node controllers. Updating /etc/mysql/conf.d/cluster.cnf and replacing wsrep_cluster_address=gcomm://10.22.167.75, to gcomm:// gets the server started once again. This should probably be the default setting when OVERCLOUD_CONTROLSCALE == 1.

Changed in tripleo:
importance: Undecided → High
assignee: nobody → Gregory Haynes (greghaynes)
Changed in tripleo:
status: New → Confirmed

Reviewed: https://review.openstack.org/104455
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=75519a3bc8d26afb3f8ce0068d769a0095c66672
Submitter: Jenkins
Branch: master

commit 75519a3bc8d26afb3f8ce0068d769a0095c66672
Author: Gregory Haynes <email address hidden>
Date: Wed Jul 2 22:55:04 2014 -0700

    Allow single node mysql clusters to restart

    We do not allow automatic restarting for mysql clusters due to possible
    data loss issues. This is not a problem for single node clusters.

    Related-Bug: #1336110
    Change-Id: Id41c2fcf9602828b60692d866ee33737e33f87aa

Steven Hardy (shardy) wrote :

It looks like that patch marked Related-Bug may actually have fixed this, Adam/Gregory, can you confirm?

I'll try to test as well to independently validate the fix above.

Gregory Haynes (greghaynes) wrote :

Kind of. This works for only the single node case, in a multi-node setup we still cannot automatically start galera due to the lack of a leader election system.

Michele Baldessari (michele) wrote :

In multi-node setups we most likely are using pacemaker anyway which will handle the whole promotion/demotion of the nodes.

Okay to close this one?

Ben Nemec (bnemec) on 2017-06-13
Changed in tripleo:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers