mysql-mmm-monitor declares all nodes dead after long time running

Bug #1050408 reported by Quentin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
mysql-mmm
New
Undecided
Unassigned

Bug Description

Once in a while we discovered that the monitoring daemon declares all nodes dead and removes all shared IPs. Result is that all virtual IPs are not reachable anymore.
Restarting the mysql-mmm-monitor-daemon resolves the problem and the vIPs are set again. The MySQL-servers were all reachable and MySQL was running properly. I didn't restart any of the agents on the database-servers.

The log has the following output:

2012/08/08 07:12:30 FATAL State of host 'turnus' changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)
2012/08/08 07:12:33 FATAL State of host 'nox' changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)
2012/08/08 07:13:45 FATAL State of host 'nox' changed from HARD_OFFLINE to AWAITING_RECOVERY
2012/08/08 07:13:48 FATAL State of host 'turnus' changed from HARD_OFFLINE to AWAITING_RECOVERY
2012/08/08 07:14:45 FATAL State of host 'nox' changed from AWAITING_RECOVERY to ONLINE because of auto_set_online(60 seconds). It was in state AWAITING_RECOVERY for 60 seconds
2012/08/08 07:14:48 FATAL State of host 'turnus' changed from AWAITING_RECOVERY to ONLINE because of auto_set_online(60 seconds). It was in state AWAITING_RECOVERY for 60 seconds
2012/09/13 07:24:22 FATAL State of host 'turnus' changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)
2012/09/13 07:24:25 FATAL State of host 'nox' changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)
2012/09/13 08:02:37 FATAL State of host 'turnus' changed from HARD_OFFLINE to AWAITING_RECOVERY
2012/09/13 08:02:37 FATAL State of host 'nox' changed from HARD_OFFLINE to AWAITING_RECOVERY
2012/09/13 08:03:37 FATAL State of host 'turnus' changed from AWAITING_RECOVERY to ONLINE because of auto_set_online(60 seconds). It was in state AWAITING_RECOVERY for 60 seconds
2012/09/13 08:03:37 FATAL State of host 'nox' changed from AWAITING_RECOVERY to ONLINE because of auto_set_online(60 seconds). It was in state AWAITING_RECOVERY for 60 seconds

Log MySQL node Nox:

2012/09/13 07:24:22 INFO We have some new roles added or old rules deleted!
2012/09/13 07:24:22 INFO Added: reader(XXX.XXX.XXX.XXX)
2012/09/13 07:24:25 INFO We have some new roles added or old rules deleted!
2012/09/13 07:24:25 INFO Deleted: reader(XXX.XXX.XXX.XXX), reader(XXX.XXX.XXX.XXX), writer(XXX.XXX.XXX.XXX)
2012/09/13 08:03:37 INFO We have some new roles added or old rules deleted!
2012/09/13 08:03:37 INFO Added: reader(XXX.XXX.XXX.XXX), writer(XXX.XXX.XXX.XXX)

Log MySQL node Turnus:

2012/08/08 07:12:30 INFO We have some new roles added or old rules deleted!
2012/08/08 07:12:30 INFO Deleted: reader(XXX.XXX.XXX.XXX)
2012/08/08 07:14:48 INFO We have some new roles added or old rules deleted!
2012/08/08 07:14:48 INFO Added: reader(XXX.XXX.XXX.XXX)
2012/09/13 07:24:22 INFO We have some new roles added or old rules deleted!
2012/09/13 07:24:22 INFO Deleted: reader(XXX.XXX.XXX.XXX)
2012/09/13 08:03:37 INFO We have some new roles added or old rules deleted!
2012/09/13 08:03:37 INFO Added: reader(XXX.XXX.XXX.XXX)

The incident started at 7:22 and I resolved it by restarting the monitoring daemon at 8:02.
The running time of the daemon was approx. 70 days 12 hours.

We are running version:

mysql-mmm-2.2.1-1.el5
mysql-mmm-monitor-2.2.1-1.el5
mysql-mmm-agent-2.2.1-1.el5

Quentin (quentin-dg)
description: updated
Quentin (quentin-dg)
description: updated
Revision history for this message
René Schultz Madsen (p-rm-l) wrote :

I have the same problem with the following versions installed on the monitor host:

mysql-mmm-common 2.2.1-1
mysql-mmm-monitor 2.2.1-1

the following versions installed on the databases hosts:

libdbd-mysql-perl 4.016-1
libmysqlclient16 5.1.63-0+squeeze1
mysql-client-5.1 5.1.63-0+squeeze1
mysql-common 5.1.63-0+squeeze1
mysql-mmm-agent 2.2.1-1
mysql-mmm-common 2.2.1-1
mysql-server 5.1.63-0+squeeze1
mysql-server-5.1 5.1.63-0+squeeze1
mysql-server-core-5.1 5.1.63-0+squeeze1

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.