galera server doesn't start with systemd

Bug #1634562 reported by Kyle L. Henderson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
High
Andy McCrae

Bug Description

When deploying multiple controller nodes using 16.04 and stable/newton, the galera servers beyond the first controller fail to start properly.

Chris Martin reported the issue in comment #4 in: https://bugs.launchpad.net/openstack-ansible/+bug/1633472

This problem is repeatable and I've watched the galera container as galera is started on controller 2 and 3. The patch to handlers from bug 1633472 is applied during these test runs.

What appears to be happening is that galera is started but it's not recognized by systemd. It then starts a second instance, which fails and causes the task to be run again. This continues until all retries are exhausted.

Here's a ps capture while the two mysql processes are running:

message+ 5801 0.0 0.0 42892 3364 ? Ss 19:07 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root 10326 0.0 0.0 21248 3284 ? S 19:10 0:00 /bin/bash /usr/bin/mysqld_safe
mysql 10847 0.0 1.7 1229256 71536 ? Dl 19:11 0:01 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep_provider=/usr
/lib/galera/libgalera_smm.so --log-error=/var/log/mysql_logs/galera_server_error.log --open-files-limit=65535 --pid-file=vm2-galera-container-c0d6ecab.pid --socket=/var/run/mysqld/mysqld.sock
 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
root 11319 0.0 0.0 4508 756 ? S 19:11 0:00 /bin/sh -c LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 MYSQLD_STARTUP_TIMEOUT=180 LC_MESSAGES=en_US.UTF-8 /usr/bin/python && sleep 0
root 11320 0.0 0.2 32380 10548 ? S 19:11 0:00 /usr/bin/python
root 11321 0.0 0.3 37044 13076 ? S 19:11 0:00 /usr/bin/python /tmp/ansible_5j4Nc1/ansible_module_service.py
root 11334 0.0 0.2 37044 9672 ? S 19:11 0:00 /usr/bin/python /tmp/ansible_5j4Nc1/ansible_module_service.py
root 11335 0.0 0.0 26168 1404 ? S 19:11 0:00 /bin/systemctl start mysql
root 11336 0.0 0.0 21348 3620 ? Ss 19:11 0:00 /bin/bash /etc/init.d/mysql start
root 11362 0.0 0.0 21248 3452 ? S 19:11 0:00 /bin/bash /usr/bin/mysqld_safe
mysql 11632 0.0 14.4 5166456 583384 ? Dl 19:11 0:01 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep_provider=/usr
/lib/galera/libgalera_smm.so --log-error=/var/log/mysql_logs/galera_server_error.log --open-files-limit=65535 --pid-file=vm2-galera-container-c0d6ecab.pid --socket=/var/run/mysqld/mysqld.sock
 --wsrep_recover --log_error=/var/lib/mysql/wsrep_recovery.wSheQU --pid-file=/var/lib/mysql/vm2-galera-container-c0d6ecab-recover.pid

Changed in openstack-ansible:
assignee: nobody → Alexandra Settle (alexandra-settle)
assignee: Alexandra Settle (alexandra-settle) → nobody
assignee: nobody → Andy McCrae (andrew-mccrae)
importance: Undecided → High
Revision history for this message
Kyle L. Henderson (kyleh) wrote :

I had recreated this issue a couple of time yesterday but couldn't really see what the actual issue was. Suspecting it was some type of timing issue, I recreated it today and bumped up the sleep and delays in /etc/ansible/roles/galera_server/handlers/main.yml:

- name: Restart mysql
  service:
    name: mysql
    state: restarted
    sleep: 20
    pattern: mysql
    args: "{{ (not galera_existing_cluster | bool and inventory_hostname == galera_server_bootstrap_node) or (galera_cluster_members | length == 1) | ternary('--wsrep-new-cluster', '') }}"
  environment:
    MYSQLD_STARTUP_TIMEOUT: 180
  when:
     - not galera_running_and_bootstrapped | bool
  register: galera_restart
  until: galera_restart | success
  retries: 4
  delay: 30
  # notifies are only fired when status is "changed"
  changed_when: galera_restart | failed
  failed_when: false
  notify:
    - "Remove stale .sst"
    - "Restart mysql fall back"

That did the trick and galera started on both controller (#2 and #3) that previously always failed. It does fail the first time but succeeds on the first retry.

Of course, now the question is whether the sleep between the stop/start helped it or was it the delay. I hate bumping up sleeps/delays to fix problems...

Revision history for this message
Kyle L. Henderson (kyleh) wrote :

I reran the test last night. Since galera_server/handlers/main.yml has two sections of code that attempt to restart galera, I increased the sleep time to 20 seconds in the first section and increased the delay in the second section. Again galera started on all the controller successfully but on the 2nd and 3rd nodes it failed to restart in the first section and succeeded on the second attempt in the second section.

RUNNING HANDLER [galera_server : Restart mysql] ********************************
FAILED - RETRYING: HANDLER: galera_server : Restart mysql (3 retries left).
FAILED - RETRYING: HANDLER: galera_server : Restart mysql (2 retries left).
FAILED - RETRYING: HANDLER: galera_server : Restart mysql (1 retries left).
changed: [vm2_galera_container-b7de5808]

RUNNING HANDLER [galera_server : Remove stale .sst] ****************************
changed: [vm2_galera_container-b7de5808]

RUNNING HANDLER [galera_server : Restart mysql fall back] **********************
FAILED - RETRYING: HANDLER: galera_server : Restart mysql fall back (3 retries left).
changed: [vm2_galera_container-b7de5808]

So it seems that the increased sleep time has no affect and it's the delay time that is allowing galera to start successfully.

Revision history for this message
Andy McCrae (andrew-mccrae) wrote :

Hey Kyle,

I'm trying to recreate this. I haven't had an issue on an AIO with affinity: set to 3 for the galera_container.

I'll try a multi-node build to get the same issue, but to confirm you're using head of stable/newton on 16.04.

I used master, but head of stable/newton should be pretty similar.

Revision history for this message
Kyle L. Henderson (kyleh) wrote :

Correct. 16.04, stable/newton. All the controllers are VMs. Hmm...maybe the VMs are a little too lean on memory. They're currently running with 4G of memory. I'll boost that up and see if it makes a difference. CPUs are set to 4.

Revision history for this message
Andrew Garner (abg) wrote :

I was able to reproduce this in a multi-node environment, where the initial SST takes longer than in an AIO environment.

In my environment, initial SST for the 2nd and 3rd Galera instances takes about 50 seconds. The MariaDB initscript defaults to a 30 second timeout.

On Ubuntu 16.04, the initscript is started through an isolated systemd environment (via ansible service module -> systemctl start mysql). This ignores the MYSQLD_STARTUP_TIMEOUT environment variable set in the "Restart mysql" handler. Further, on systemd platforms /etc/default/mysql is not generated by the galera_server role, so the default 30s timeout ends up being used.

Changing "delay" helps avoid this because it gives enough time for the service to startup - so one of the following retries of "/etc/init.d/mysql" is effectively a noop and succeeds as mysqld is already online.

Just dropping the file /etc/default/mysql on ubuntu 16.04 seems to work for me to make startup reliable. That is (among other options), just deleting this one line seems to work:

https://github.com/openstack/openstack-ansible-galera_server/blob/2b610bac6559a30f31670b2330b7c65a721f9bbd/tasks/galera_post_install.yml#L93

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-galera_server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/389066

Changed in openstack-ansible:
status: New → In Progress
Revision history for this message
Andy McCrae (andrew-mccrae) wrote :

@abg, thanks a for the analysis - that makes a lot of sense, and I can see the same change was made for 14.04, but was made to only apply to "upstart" and not "systemd" when 16.04 support was added.

While we were updating the init script correctly to source the /etc/default/mysql file if it existed, we were not actually dropping that file down on systemd systems.

I've posted a fix - Kyle could you test that and let me know if that works for you? We can backport it once confirmed.

Revision history for this message
Chris Martin (6-chris-z) wrote :

In my latest build the first run of setup-infrastructure.yml failed:
```
fatal: [infra03_galera_container-ade20eba]: FAILED! => {"changed": false, "failed": true, "msg": "Job for mysql.service failed because the control process exited with error code. See \"systemctl status mysql.service\" and \"journalctl -xe\" for details.\n"}
```

Applied change 389066 linked above, re-ran setup-infrastructure.yml, playbook completed successfully. So, works for me.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-galera_server (master)

Reviewed: https://review.openstack.org/389066
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=9a8fced0f214734f2de28e8178e06a04ce2235f7
Submitter: Jenkins
Branch: master

commit 9a8fced0f214734f2de28e8178e06a04ce2235f7
Author: Andy McCrae <email address hidden>
Date: Thu Oct 20 09:40:41 2016 +0100

    Fix MYSQLD_STARTUP_TIMEOUT for systemd

    The /etc/default/mysql file needs to be sourced for systemd to start
    mysql with the appropriate MYSQLD_STARTUP_TIMEOUT value.

    We are already adding the line to source the /etc/default/mysql file in
    the init.d/mysql script, but we were skipping dropping the defaults file
    when using "systemd".

    Additionally, this groups the init file change and the default file
    templating to be sequential.

    Change-Id: If5dc328e67bde9d84f40c2b192c217e82e1987eb
    Closes-Bug: #1634562

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-galera_server (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/390885

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-galera_server (stable/newton)

Reviewed: https://review.openstack.org/390885
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=ca396469cc48748905b729f408eff3350bc30f46
Submitter: Jenkins
Branch: stable/newton

commit ca396469cc48748905b729f408eff3350bc30f46
Author: Andy McCrae <email address hidden>
Date: Thu Oct 20 09:40:41 2016 +0100

    Fix MYSQLD_STARTUP_TIMEOUT for systemd

    The /etc/default/mysql file needs to be sourced for systemd to start
    mysql with the appropriate MYSQLD_STARTUP_TIMEOUT value.

    We are already adding the line to source the /etc/default/mysql file in
    the init.d/mysql script, but we were skipping dropping the defaults file
    when using "systemd".

    Additionally, this groups the init file change and the default file
    templating to be sequential.

    Change-Id: If5dc328e67bde9d84f40c2b192c217e82e1987eb
    Closes-Bug: #1634562
    (cherry picked from commit 9a8fced0f214734f2de28e8178e06a04ce2235f7)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible-galera_server 14.0.1

This issue was fixed in the openstack/openstack-ansible-galera_server 14.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible-galera_server 15.0.0.0b1

This issue was fixed in the openstack/openstack-ansible-galera_server 15.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.