galera server doesn't start with systemd
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack-Ansible |
Fix Released
|
High
|
Andy McCrae |
Bug Description
When deploying multiple controller nodes using 16.04 and stable/newton, the galera servers beyond the first controller fail to start properly.
Chris Martin reported the issue in comment #4 in: https:/
This problem is repeatable and I've watched the galera container as galera is started on controller 2 and 3. The patch to handlers from bug 1633472 is applied during these test runs.
What appears to be happening is that galera is started but it's not recognized by systemd. It then starts a second instance, which fails and causes the task to be run again. This continues until all retries are exhausted.
Here's a ps capture while the two mysql processes are running:
message+ 5801 0.0 0.0 42892 3364 ? Ss 19:07 0:00 /usr/bin/
root 10326 0.0 0.0 21248 3284 ? S 19:10 0:00 /bin/bash /usr/bin/
mysql 10847 0.0 1.7 1229256 71536 ? Dl 19:11 0:01 /usr/sbin/mysqld --basedir=/usr --datadir=
/lib/galera/
--wsrep_
root 11319 0.0 0.0 4508 756 ? S 19:11 0:00 /bin/sh -c LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 MYSQLD_
root 11320 0.0 0.2 32380 10548 ? S 19:11 0:00 /usr/bin/python
root 11321 0.0 0.3 37044 13076 ? S 19:11 0:00 /usr/bin/python /tmp/ansible_
root 11334 0.0 0.2 37044 9672 ? S 19:11 0:00 /usr/bin/python /tmp/ansible_
root 11335 0.0 0.0 26168 1404 ? S 19:11 0:00 /bin/systemctl start mysql
root 11336 0.0 0.0 21348 3620 ? Ss 19:11 0:00 /bin/bash /etc/init.d/mysql start
root 11362 0.0 0.0 21248 3452 ? S 19:11 0:00 /bin/bash /usr/bin/
mysql 11632 0.0 14.4 5166456 583384 ? Dl 19:11 0:01 /usr/sbin/mysqld --basedir=/usr --datadir=
/lib/galera/
--wsrep_recover --log_error=
Changed in openstack-ansible: | |
assignee: | nobody → Alexandra Settle (alexandra-settle) |
assignee: | Alexandra Settle (alexandra-settle) → nobody |
assignee: | nobody → Andy McCrae (andrew-mccrae) |
importance: | Undecided → High |
I had recreated this issue a couple of time yesterday but couldn't really see what the actual issue was. Suspecting it was some type of timing issue, I recreated it today and bumped up the sleep and delays in /etc/ansible/ roles/galera_ server/ handlers/ main.yml:
- name: Restart mysql existing_ cluster | bool and inventory_hostname == galera_ server_ bootstrap_ node) or (galera_ cluster_ members | length == 1) | ternary( '--wsrep- new-cluster' , '') }}" STARTUP_ TIMEOUT: 180 running_ and_bootstrappe d | bool
service:
name: mysql
state: restarted
sleep: 20
pattern: mysql
args: "{{ (not galera_
environment:
MYSQLD_
when:
- not galera_
register: galera_restart
until: galera_restart | success
retries: 4
delay: 30
# notifies are only fired when status is "changed"
changed_when: galera_restart | failed
failed_when: false
notify:
- "Remove stale .sst"
- "Restart mysql fall back"
That did the trick and galera started on both controller (#2 and #3) that previously always failed. It does fail the first time but succeeds on the first retry.
Of course, now the question is whether the sleep between the stop/start helped it or was it the delay. I hate bumping up sleeps/delays to fix problems...