juju agent upgrade causes mysqld to stop (part of same systemd cgroup)

Bug #1664025 reported by Nobuto Murata
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Undecided
Unassigned
OpenStack Percona Cluster Charm
Fix Released
Critical
James Page
percona-cluster (Juju Charms Collection)
Invalid
Critical
Unassigned
percona-xtradb-cluster-5.6 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Juju agent upgrade causes whole openstack down which is critical.

How to reproduce:
$ juju bootstrap --config agent-version=2.0.2

$ juju deploy ./bundle.yaml

$ juju run --unit mysql/0 'pgrep -af mysqld'
14743 /bin/sh /usr/bin/mysqld_safe --wsrep-new-cluster
15242 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/percona-xtradb-cluster --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep-provider=/usr/lib/libgalera_smm.so --wsrep-new-cluster --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1

(controller model)
$ juju upgrade-juju -m controller --agent-version 2.0.3
(openstack model)
$ juju upgrade-juju --agent-version 2.0.3

$ juju run --unit mysql/0 'pgrep -af mysqld'
-> empty (no mysqld is running)

$ juju status mysql
Model Controller Cloud/Region Version
default localhost-localhost localhost/localhost 2.0.3

App Version Status Scale Charm Store Rev OS Notes
mysql 5.6.21-25.8 error 1 percona-cluster jujucharms 241 ubuntu

Unit Workload Agent Machine Public address Ports Message
mysql/0* error idle 2 10.0.8.104 hook failed: "config-changed"

Machine State DNS Inst id Series AZ2 started 10.0.8.104 juju-50b253-2 xenial
...

Revision history for this message
Nobuto Murata (nobuto) wrote :
description: updated
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Somehow mysqld "Normal shutdown" kicked twice around agent upgrading.

00:37 is localtime while 15:37 is UTC.

[/var/log/mysql/error.log]
...
2017-02-12 15:37:47 15242 [Note] /usr/sbin/mysqld: Normal shutdown
...
2017-02-12 15:37:52 15242 [Note] /usr/sbin/mysqld: Shutdown complete
...
2017-02-12 15:38:52 20201 [Note] /usr/sbin/mysqld: Normal shutdown
...
2017-02-12 15:38:56 20201 [Note] /usr/sbin/mysqld: Shutdown complete

[journalctl -u mysql]
Feb 12 15:37:47 juju-50b253-2 systemd[1]: Stopping LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon...
Feb 12 15:37:47 juju-50b253-2 mysql[19583]: * Stopping MySQL (Percona XtraDB Cluster) mysqld
Feb 12 15:37:52 juju-50b253-2 /etc/init.d/mysql[19644]: MySQL PID not found, pid_file detected/guessed: /var/run/mysqld/mysqld.pid
Feb 12 15:37:52 juju-50b253-2 /etc/init.d/mysql[19648]: MySQL PID not found, pid_file detected/guessed: /var/run/mysqld/mysqld.pid
Feb 12 15:37:52 juju-50b253-2 mysql[19583]: ...done.
Feb 12 15:37:52 juju-50b253-2 systemd[1]: Stopped LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.
Feb 12 15:38:03 juju-50b253-2 systemd[1]: Starting LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon...
Feb 12 15:38:04 juju-50b253-2 mysql[20252]: * Starting MySQL (Percona XtraDB Cluster) database server mysqld
Feb 12 15:38:04 juju-50b253-2 mysql[20252]: ...done.
Feb 12 15:38:04 juju-50b253-2 systemd[1]: Started LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.

[unit-mysql-0.log]
unit-mysql-0: 00:37:31 INFO juju.worker.upgrader desired tool version: 2.0.2
...
unit-mysql-0: 00:37:44 DEBUG unit.mysql/0.juju-log Leader unit - bootstrap required=True
unit-mysql-0: 00:37:47 INFO unit.mysql/0.juju-log Writing file /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf root:root 444
unit-mysql-0: 00:37:52 INFO unit.mysql/0.config-changed Unknown operation bootstrap-pxc.
unit-mysql-0: 00:37:52 INFO unit.mysql/0.config-changed * Bootstrapping Percona XtraDB Cluster database server mysqld
unit-mysql-0: 00:38:00 INFO juju.worker.leadership mysql/0 will renew mysql leadership at 2017-02-12 15:38:30.142709216 +0000 UTC
unit-mysql-0: 00:38:03 INFO unit.mysql/0.config-changed ...done.
unit-mysql-0: 00:38:04 DEBUG unit.mysql/0.juju-log Bootstrap PXC Succeeded
...

unit-mysql-0: 00:38:51 INFO juju.worker.upgrader upgrade requested from 2.0.2 to 2.0.3

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Ok, this issue is simply reproducible with a single percona-cluster unit. OpenStack deployment is not necessary to reproduce.

description: updated
description: updated
Revision history for this message
Nobuto Murata (nobuto) wrote :

Well, I reproduced it with a single percona-cluster once, but not in the second time. It might be related to some race conditions, so OpenStack deployment with more relation may be necessary to reproduce.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Looks like juju unit agent is tied with mysqld process somehow. That's why agent upgrade (agent stop/start) causes mysql clean shutdown.

$ sudo systemctl status jujud-unit-mysql-0
● jujud-unit-mysql-0.service - juju unit agent for mysql/0
   Loaded: loaded (/var/lib/juju/init/jujud-unit-mysql-0/jujud-unit-mysql-0.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2017-02-13 09:57:30 UTC; 1min 47s ago
 Main PID: 16588 (bash)
    Tasks: 40
   Memory: 1.2G
      CPU: 4.353s
   CGroup: /system.slice/jujud-unit-mysql-0.service
           ├─16588 bash /var/lib/juju/init/jujud-unit-mysql-0/exec-start.sh
           ├─16592 /var/lib/juju/tools/unit-mysql-0/jujud unit --data-dir /var/lib/juju --unit-name mysql/0 --debug
           ├─16918 /bin/sh /usr/bin/mysqld_safe --wsrep-new-cluster
           └─17400 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/percona-xtradb-cluster --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep-provider=/usr/lib/libgalera_smm.so --wsrep

Feb 13 09:57:30 juju-cb5bdd-0 systemd[1]: Started juju unit agent for mysql/0.

$ sudo systemctl stop jujud-unit-mysql-0.service
$ pgrep -af mysqld
-> empty (no mysqld is running)

Revision history for this message
James Page (james-page) wrote :

Erm that does not look right to me (the fact that the mysql processes are part of the cgroup for the juju unit, resulting in them being terminated by systemd).

Revision history for this message
James Page (james-page) wrote :

Confirmed:

$ sudo systemctl status jujud-unit-percona-cluster-0.service
● jujud-unit-percona-cluster-0.service - juju unit agent for percona-cluster/0
   Loaded: loaded (/var/lib/juju/init/jujud-unit-percona-cluster-0/jujud-unit-percona-cluster-0.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2017-02-13 10:44:10 UTC; 6min ago
 Main PID: 17665 (bash)
    Tasks: 39
   Memory: 505.5M
      CPU: 1min 6.129s
   CGroup: /system.slice/jujud-unit-percona-cluster-0.service
           ├─17665 bash /var/lib/juju/init/jujud-unit-percona-cluster-0/exec-start.sh
           ├─17671 /var/lib/juju/tools/unit-percona-cluster-0/jujud unit --data-dir /var/lib/juju --unit-name percona-cluster/0 --debug
           ├─28429 /bin/sh /usr/bin/mysqld_safe --wsrep-new-cluster
           └─28927 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/percona-xtradb-cluster --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep-provider=/usr/lib/libgalera_smm.so --wsrep-new-cluster -

Revision history for this message
James Page (james-page) wrote :

(with juju 2.1 beta5)

Changed in percona-cluster (Juju Charms Collection):
status: New → Confirmed
Revision history for this message
James Page (james-page) wrote :

I think this maybe todo with the way that we have to bootstrap the PXC cluster.

Revision history for this message
James Page (james-page) wrote :

The charm has to init the local instance using:

  service mysql bootstrap-pxc

It would appear that this ends up being tracked under the cgroup for the unit daemon, resulting in this problem when a restart of the unit daemon occurs; as this can happen across all unit daemons in the pxc cluster during a juju upgrade-juju operation, this is probably the root cause.

However, I would expect the follower units to have been started normally, and not be part of the unit cgroup - so only the lead unit should see this effect.

summary: - juju agent upgrade causes mysqld to stop
+ juju agent upgrade causes mysqld to stop (part of same systemd cgroup)
Revision history for this message
James Page (james-page) wrote :

FWIW we might expect the service command to ensure that anything started under it is not part of the systemd cgroup for the calling daemon.

Changed in percona-cluster (Juju Charms Collection):
importance: Undecided → Critical
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am marking this bug as Incomplete for Juju while James investigates further.

Thank you for your advice and patience!

Changed in juju:
status: New → Incomplete
Revision history for this message
Nobuto Murata (nobuto) wrote :

> However, I would expect the follower units to have been started normally, and not be part of the unit cgroup - so only the lead unit should see this effect.

Right. A quick workaround for this issue is rebooting the leader unit after cluster gets up and running with multiple units so mysqld will be spawned outside of the process tree of juju unit daemon.

I saw this issue at a customer site as one of three mysqld gets down suddenly on agent upgrade.

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

Hit the same issue when upgrading Juju from 2.0.3 to 2.1.0.1 with the percona-cluster charm rev 247.

James Page (james-page)
Changed in charm-percona-cluster:
importance: Undecided → Critical
status: New → Confirmed
Changed in percona-cluster (Juju Charms Collection):
status: Confirmed → Invalid
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Sandor Zeestraten (szeestraten),
And did workaround mentioned in comment # 17 - rebooting the leader unit after cluster gets up and running - help?

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

@anastasia-macmood

Yes, restarting the leader unit seemed to work.

Revision history for this message
James Page (james-page) wrote :

You can also just do a stop/start of the mysql service; this issue is that the lead unit is bootstrapped using the service command, not systemctl:

   service mysql bootstrap-oxc

as a result the mysqld process ends up in the wrong cgroup; restarting it using systemctl will correct this.

Revision history for this message
James Page (james-page) wrote :

Really the bootstrap startup needs to be executed by systemd as well; that's a bit of a packaging change to make that easier to consume (the CentOS packages already provide appropriate systemd units for this).

In stead of that I'm looking to see if we can persuade the service command to place the mysqld processes outside the scope of the jujud-unit-* cgroup.

James Page (james-page)
Changed in charm-percona-cluster:
status: Confirmed → Triaged
milestone: none → 17.05
Revision history for this message
James Page (james-page) wrote :

Marking Juju task as invalid - the way that pxc is bootstrapped in the primary cause of this issue.

Changed in juju:
status: Incomplete → Invalid
Revision history for this message
James Page (james-page) wrote :

I can workaround this problem in the charm by using systemd-run to ensure that the bootstrap-pxc mysqld gets its own cgroup, but this does need a broader fix in packaging as well (raised distro task to cover this).

Changed in charm-percona-cluster:
status: Triaged → In Progress
assignee: nobody → James Page (james-page)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/438917

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/438917
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=fddc1b78f4251db97129f5246fff92fefb18353f
Submitter: Jenkins
Branch: master

commit fddc1b78f4251db97129f5246fff92fefb18353f
Author: James Page <email address hidden>
Date: Tue Feb 28 11:57:43 2017 +0100

    Ensure bootstrap-pxc mysqld not in unit cgroup

    The bootstrap process for percona-xtradb-cluster requires execution
    of a non-standard init.d scrip target to start the mysqld in wsrep
    new cluster mode.

    The processes started by the operation where ending up in the cgroup
    associated with the Juju unit daemon, which on restart (as a result
    of a upgrade to juju for example) would result in the mysql daemon
    being killed and not restarted.

    Use systemd-run to ensure that the bootstrap-pxc operation ends up
    in a distinct cgroup so that this does not happen.

    Change-Id: Iff998c4c23fcad71cffe9bbee60df7f00d2c9893
    Closes-Bug: 1664025

Changed in charm-percona-cluster:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (stable/17.02)

Fix proposed to branch: stable/17.02
Review: https://review.openstack.org/440278

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (stable/17.02)

Reviewed: https://review.openstack.org/440278
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=189e5377e0b2917cb663eec3611071e5853707b5
Submitter: Jenkins
Branch: stable/17.02

commit 189e5377e0b2917cb663eec3611071e5853707b5
Author: James Page <email address hidden>
Date: Tue Feb 28 11:57:43 2017 +0100

    Ensure bootstrap-pxc mysqld not in unit cgroup

    The bootstrap process for percona-xtradb-cluster requires execution
    of a non-standard init.d scrip target to start the mysqld in wsrep
    new cluster mode.

    The processes started by the operation where ending up in the cgroup
    associated with the Juju unit daemon, which on restart (as a result
    of a upgrade to juju for example) would result in the mysql daemon
    being killed and not restarted.

    Use systemd-run to ensure that the bootstrap-pxc operation ends up
    in a distinct cgroup so that this does not happen.

    Drop capture of pty for bootstrap-pxc

    Use of the '-t' flag to capture the output of the pty results
    in a non-zero return code in later systemd/Ubuntu releases
    (specifically zesty).

    Drop use of this flag for broader compatibility.

    Change-Id: Iff998c4c23fcad71cffe9bbee60df7f00d2c9893
    Closes-Bug: 1664025
    Closes-Bug: 1668833
    (cherry picked from commit fddc1b78f4251db97129f5246fff92fefb18353f)
    (cherry picked from commit 1cae1942d451e0daf9681b2c823643c42565bc33)

James Page (james-page)
Changed in charm-percona-cluster:
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in percona-xtradb-cluster-5.6 (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.