nrpe: "CRITICAL: mysql.service is not running" on leader/bootstrap unit (active/exited state)

Bug #1685696 reported by Nobuto Murata
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
NRPE Charm
Won't Fix
Medium
Unassigned
OpenStack Percona Cluster Charm
Won't Fix
Undecided
Unassigned

Bug Description

$ juju deploy -n 3 percona-cluster
$ juju deploy nagios
$ juju deploy nrpe

$ juju add-relation nagios nrpe
$ juju add-relation nrpe percona-cluster

Nagios reports "CRITICAL: mysql.service is not running" on the leader/bootstrap unit.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

I believe the background in Bug: #1664025 is related.

    ● mysql.service - LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon
       Loaded: loaded (/etc/init.d/mysql; bad; vendor preset: enabled)
       Active: active (exited) since Mon 2017-04-24 04:37:14 UTC; 28min ago
         Docs: man:systemd-sysv-generator(8)
        Tasks: 0
       Memory: 0B
          CPU: 0

    Apr 24 04:37:14 juju-195a01-0 systemd[1]: Starting LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon...
    Apr 24 04:37:14 juju-195a01-0 mysql[16095]: * Starting MySQL (Percona XtraDB Cluster) database server mysqld
    Apr 24 04:37:14 juju-195a01-0 mysql[16095]: ...done.
    Apr 24 04:37:14 juju-195a01-0 systemd[1]: Started LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.
    Apr 24 04:37:25 juju-195a01-0 systemd[1]: Started LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.
  UnitId: percona-cluster/0

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

A simple workaround would be rebooting the leader unit after the cluster is up and running.

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

Ran into the same issue.

ubuntu:~$ juju status mysql
Model Controller Cloud/Region Version
openstack prodcont1 prodmaas 2.1.2

App Version Status Scale Charm Store Rev OS Notes
mysql 5.6.34-26.19 active 3 percona-cluster jujucharms 250 ubuntu
mysql-nrpe unknown 3 nrpe jujucharms 21 ubuntu

tags: added: foundations-engine
Revision history for this message
James Page (james-page) wrote :

This happens because the lead unit does not get started using the normal start target provided by the init.d scripts (which are wrapped by systemd normally).

The lead unit is started using the bootstrap-pxc command directly initiated from the init.d script, rather than via the systemctl command.

Revision history for this message
James Page (james-page) wrote :

The systemd mysql unit records - "active (exited)" rather than "active (running)" as the process is not tracked by systemd when the bootstrap path is followed.

Revision history for this message
James Page (james-page) wrote :

I'd really rather avoid having to restart percona-cluster just to get the monitoring right; maybe looking at the 'ActiveState' rather than the 'SubState' might make sense here - the unit is active, but is in 'exited' SubState, which is what is causing the problem with the nrpe check.

Changed in charm-percona-cluster:
status: New → Won't Fix
Revision history for this message
James Page (james-page) wrote :

Raised a task on the NRPE charm and marked the PXC task as won't fix; I don't think we can resolve this in the percona-cluster charm purely due to the way that pxc has to be bootstrapped.

summary: nrpe: "CRITICAL: mysql.service is not running" on leader/bootstrap unit
+ (active/exited state)
Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

For future travellers, the fix is likely line 23 of https://git.launchpad.net/nrpe-charm/tree/files/plugins/check_systemd.py and the tests after it (string will be "active" rather than "running" when we use the ActiveState). This is worth testing on multiple installations with a hand-rolled copy to make sure it behaves as we expect in production.

Changed in nrpe-charm:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Deepa (dpaclt) wrote :

Will this be fixed in next release of nrpe ie nrpe#42 ?

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I followed this reproducer but can't reproduce this. ( or need more condition? )

This issue is still there?

Revision history for this message
Nobuto Murata (nobuto) wrote :

Indeed, it's not reproducible with the current latest charm. rev no. 266 seems to bring the difference.

$ for i in 267 266 265; do
    juju add-model xenial-$i
    juju deploy --series xenial percona-cluster-$i -n 3
    juju config percona-cluster min-cluster-size=3
done

$ for i in 267 266 265; do echo percona-cluster-$i; juju run -m xenial-$i --application percona-cluster -- systemctl status mysql | grep Active:; done
percona-cluster-267
       Active: active (running) since Mon 2018-07-09 16:31:36 UTC; 17min ago
       Active: active (running) since Mon 2018-07-09 16:32:03 UTC; 16min ago
       Active: active (running) since Mon 2018-07-09 16:32:36 UTC; 16min ago
percona-cluster-266
       Active: active (running) since Mon 2018-07-09 16:32:03 UTC; 16min ago
       Active: active (running) since Mon 2018-07-09 16:32:30 UTC; 16min ago
       Active: active (running) since Mon 2018-07-09 16:33:09 UTC; 15min ago
percona-cluster-265
       Active: active (exited) since Mon 2018-07-09 16:28:44 UTC; 20min ago <<<<<<<<<<
       Active: active (running) since Mon 2018-07-09 16:31:08 UTC; 17min ago
       Active: active (running) since Mon 2018-07-09 16:31:49 UTC; 17min ago

https://api.jujucharms.com/charmstore/v5/percona-cluster-266/archive/repo-info
https://api.jujucharms.com/charmstore/v5/percona-cluster-265/archive/repo-info

Revision history for this message
Nobuto Murata (nobuto) wrote :

The commit 801c2e78294f7af0a058aca036b6d066e3f53a98 in percona-cluster charm has a fix for this issue as well because it restarts mysqld after bootstrapped.

$ git bisect log
git bisect start '--term-new=fixed' '--term-old=unfixed'
# fixed: [78d1df2d50d94bae829c33767bc0e99a7e2afcf1] Set server_id when using binlogs
git bisect fixed 78d1df2d50d94bae829c33767bc0e99a7e2afcf1
# unfixed: [bf84218e31a832a04fc15385997cba8d2ec4eed6] Merge "Sync charm-helpers"
git bisect unfixed bf84218e31a832a04fc15385997cba8d2ec4eed6
# fixed: [19e1fce3db6d4e4b0cdd4946020838d3055f5522] Update tox.ini to stop using unverified package
git bisect fixed 19e1fce3db6d4e4b0cdd4946020838d3055f5522
# unfixed: [9cac8b85211c01fa7cb5637568bd63dfd3b21844] Merge "Add support for PXC 5.7 and xtrabackup 2.4"
git bisect unfixed 9cac8b85211c01fa7cb5637568bd63dfd3b21844
# fixed: [801c2e78294f7af0a058aca036b6d066e3f53a98] Redesign cluster buildup process
git bisect fixed 801c2e78294f7af0a058aca036b6d066e3f53a98
# first fixed commit: [801c2e78294f7af0a058aca036b6d066e3f53a98] Redesign cluster buildup process

Reviewed: https://review.openstack.org/555494
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=801c2e78294f7af0a058aca036b6d066e3f53a98
Submitter: Zuul
Branch: master

commit 801c2e78294f7af0a058aca036b6d066e3f53a98
Author: David Ames <email address hidden>
Date: Mon Mar 19 11:25:56 2018 -0700

    Redesign cluster buildup process

    In order to fix bug#1756928 the whole cluster buildup process needed to
    be redesigned. The assumptions about what is_bootstrapped and clustered
    meant and when to restart on configuration changed needed to be
    re-evaluated.

    The timing of restarts needed to be protected to avoid collisions.
    Only bootstrapped hosts should go in to the
    wsrep_cluster_address=gcomm:// setting. Adding or removing units should
    be handled gracefully. Starting with a single unit and expanding to a
    cluster must work.

    This change guarantees mysqld is restarted when the configuration file
    changes and meets all the above requirements. As a consequence of the redesign,
    the workload status now more accurately reflects the state of the unit.

    Charm-helpers sync to bring in distributed_wait fix.

    Closes-Bug: #1756308
    Closes-Bug: #1756928
    Change-Id: I0742e6889b32201806cec6a0b5835e11a8027567

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

@Nobuto

Cool Thanks for checking it.

@James

Not sure we need to fix nrpe as well ( checking ActiveState than SubState )

Eric Chen (eric-chen)
Changed in charm-nrpe:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.