Bug #458907 “mmm-monitor don't pass to online state” : Bugs : mysql-mmm

Revision history for this message

Walter Heck (walterheck) wrote on 2009-10-23: Re: [Bug 458907] [NEW] mmm-monitor don't pass to online state

#1

Download full text (9.6 KiB)

I've seen this happen. Have you waited more then just a few seconds?
It usually comes back up after 1 minute or so in all cases I have
seen..

Walter

On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
> Public bug reported:
>
> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
> the node change automatically to "replication delay" status but the cluster is doing nothing. and
> all checks are ok.
>
> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
> pass it to online status.
>
> i have debian 5.0.3, the perl modules are all of the distribution.
>
> HERE I KILL THE NODE:
>
> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:52 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:11 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:17 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.1...

I've seen this happen. Have you waited more then just a few seconds?
It usually comes back up after 1 minute or so in all cases I have
seen..

Walter

On Fri, Oct 23, 2009 at 17:25, ufaria <ufaria@prisacom.com> wrote:
> Public bug reported:
>
> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
> all checks are ok.
>
> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
> pass it to online status.
>
> i have debian 5.0.3, the perl modules are all of the distribution.
>
> HERE I KILL THE NODE:
>
> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>
> HERE THE NODE IS ONLINE:
>
> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from
> HARD_OFFLINE to AWAITING_RECOVERY
>
> HERE I CHANGE TO ONLINE :
>
> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from
> AWAITING_RECOVERY to ONLINE
>
> AND THEN, IT PASS TO REPLICATION DELAY:
>
> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to
> REPLICATION_DELAY
>
> NOW I RESTART THE MONITOR:
>
> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>
> AND PASS BYSELF TO ONLINE:
>
> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from
> REPLICATION_DELAY to ONLINE
>
> ** Affects: mysql-mmm
>     Importance: Undecided
>         Status: New
>
> --
> mmm-monitor don't pass to online state
> https://bugs.launchpad.net/bugs/458907
> You received this bug notification because you are subscribed to mysql-
> mmm.
>
> Status in MMM for MySQL: New
>
> Bug description:
> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
> all checks are ok.
>
> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
> pass it to online status.
>
> i have debian 5.0.3, the perl modules are all of the distribution.
>
> HERE I KILL THE NODE:
>
> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>
> HERE THE NODE IS ONLINE:
>
> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>
> HERE I CHANGE TO ONLINE :
>
> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>
> AND THEN, IT PASS TO REPLICATION DELAY:
>
> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>
> NOW I RESTART THE MONITOR:
>
> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>
> AND PASS BYSELF TO ONLINE:
>
> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>

Revision history for this message

ufaria (ufaria-prisacom) wrote on 2009-10-23:

#2

Download full text (10.0 KiB)

Yes, i've waited a lot of minutes and tried to change to online
several times. the only workaround is restart the monitor.

Walter Heck escribió:
> I've seen this happen. Have you waited more then just a few seconds?
> It usually comes back up after 1 minute or so in all cases I have
> seen..
>
> Walter
>
> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>> Public bug reported:
>>
>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>> the node change automatically to "replication delay" status but the cluster is doing nothing. and
>> all checks are ok.
>>
>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>> pass it to online status.
>>
>> i have debian 5.0.3, the perl modules are all of the distribution.
>>
>> HERE I KILL THE NODE:
>>
>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:52 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:11 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor...

Yes, i've waited a lot of minutes and tried to change to online
several times. the only workaround is restart the monitor.

Walter Heck escribió:
> I've seen this happen. Have you waited more then just a few seconds?
> It usually comes back up after 1 minute or so in all cases I have
> seen..
> 
> Walter
> 
> On Fri, Oct 23, 2009 at 17:25, ufaria <ufaria@prisacom.com> wrote:
>> Public bug reported:
>>
>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>> all checks are ok.
>>
>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>> pass it to online status.
>>
>> i have debian 5.0.3, the perl modules are all of the distribution.
>>
>> HERE I KILL THE NODE:
>>
>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>
>> HERE THE NODE IS ONLINE:
>>
>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from
>> HARD_OFFLINE to AWAITING_RECOVERY
>>
>> HERE I CHANGE TO ONLINE :
>>
>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from
>> AWAITING_RECOVERY to ONLINE
>>
>> AND THEN, IT PASS TO REPLICATION DELAY:
>>
>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to
>> REPLICATION_DELAY
>>
>> NOW I RESTART THE MONITOR:
>>
>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>
>> AND PASS BYSELF TO ONLINE:
>>
>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from
>> REPLICATION_DELAY to ONLINE
>>
>> ** Affects: mysql-mmm
>>     Importance: Undecided
>>         Status: New
>>
>> --
>> mmm-monitor don't pass to online state
>> https://bugs.launchpad.net/bugs/458907
>> You received this bug notification because you are subscribed to mysql-
>> mmm.
>>
>> Status in MMM for MySQL: New
>>
>> Bug description:
>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>> all checks are ok.
>>
>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>> pass it to online status.
>>
>> i have debian 5.0.3, the perl modules are all of the distribution.
>>
>> HERE I KILL THE NODE:
>>
>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>
>> HERE THE NODE IS ONLINE:
>>
>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>>
>> HERE I CHANGE TO ONLINE :
>>
>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>>
>> AND THEN, IT PASS TO REPLICATION DELAY:
>>
>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>>
>> NOW I RESTART THE MONITOR:
>>
>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>
>> AND PASS BYSELF TO ONLINE:
>>
>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>>
>

-- 
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Revision history for this message

Walter Heck (walterheck) wrote on 2009-10-23:

#3

Download full text (15.1 KiB)

Could you run in DEBUG mode and add output here? Just put 'debug 1'
line in your configs an restart them. I'd be curiouse to see what's
up..

Walter

On Fri, Oct 23, 2009 at 18:03, ufaria <email address hidden> wrote:
> Yes, i've waited a lot of minutes and tried to change to online
> several times. the only workaround is restart the monitor.
>
>
> Walter Heck escribió:
>> I've seen this happen. Have you waited more then just a few seconds?
>> It usually comes back up after 1 minute or so in all cases I have
>> seen..
>>
>> Walter
>>
>> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>>> Public bug reported:
>>>
>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>> the node change automatically to "replication delay" status but the cluster is doing nothing. and
>>> all checks are ok.
>>>
>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>> pass it to online status.
>>>
>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>
>>> HERE I KILL THE NODE:
>>>
>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:52 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 1...

Could you run in DEBUG mode and add output here? Just put 'debug 1'
line in your configs an restart them. I'd be curiouse to see what's
up..

Walter

On Fri, Oct 23, 2009 at 18:03, ufaria <ufaria@prisacom.com> wrote:
> Yes, i've waited a lot of minutes and tried to change to online
> several times. the only workaround is restart the monitor.
>
>
> Walter Heck escribió:
>> I've seen this happen. Have you waited more then just a few seconds?
>> It usually comes back up after 1 minute or so in all cases I have
>> seen..
>>
>> Walter
>>
>> On Fri, Oct 23, 2009 at 17:25, ufaria <ufaria@prisacom.com> wrote:
>>> Public bug reported:
>>>
>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>> all checks are ok.
>>>
>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>> pass it to online status.
>>>
>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>
>>> HERE I KILL THE NODE:
>>>
>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>
>>> HERE THE NODE IS ONLINE:
>>>
>>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from
>>> HARD_OFFLINE to AWAITING_RECOVERY
>>>
>>> HERE I CHANGE TO ONLINE :
>>>
>>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from
>>> AWAITING_RECOVERY to ONLINE
>>>
>>> AND THEN, IT PASS TO REPLICATION DELAY:
>>>
>>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to
>>> REPLICATION_DELAY
>>>
>>> NOW I RESTART THE MONITOR:
>>>
>>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>
>>> AND PASS BYSELF TO ONLINE:
>>>
>>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from
>>> REPLICATION_DELAY to ONLINE
>>>
>>> ** Affects: mysql-mmm
>>>     Importance: Undecided
>>>         Status: New
>>>
>>> --
>>> mmm-monitor don't pass to online state
>>> https://bugs.launchpad.net/bugs/458907
>>> You received this bug notification because you are subscribed to mysql-
>>> mmm.
>>>
>>> Status in MMM for MySQL: New
>>>
>>> Bug description:
>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>> all checks are ok.
>>>
>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>> pass it to online status.
>>>
>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>
>>> HERE I KILL THE NODE:
>>>
>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>
>>> HERE THE NODE IS ONLINE:
>>>
>>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>>>
>>> HERE I CHANGE TO ONLINE :
>>>
>>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>>>
>>> AND THEN, IT PASS TO REPLICATION DELAY:
>>>
>>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>>>
>>> NOW I RESTART THE MONITOR:
>>>
>>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>
>>> AND PASS BYSELF TO ONLINE:
>>>
>>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>>>
>>
>
> --
> --------------------------
>    Uxio Faria Giraldez
> --------------------------
>    SISTEMAS
>    PRISACOM S.A.
>    +34 913 537 770
> ==========================
>
> --
> mmm-monitor don't pass to online state
> https://bugs.launchpad.net/bugs/458907
> You received this bug notification because you are subscribed to mysql-
> mmm.
>
> Status in MMM for MySQL: New
>
> Bug description:
> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
> all checks are ok.
>
> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
> pass it to online status.
>
> i have debian 5.0.3, the perl modules are all of the distribution.
>
> HERE I KILL THE NODE:
>
> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>
> HERE THE NODE IS ONLINE:
>
> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>
> HERE I CHANGE TO ONLINE :
>
> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>
> AND THEN, IT PASS TO REPLICATION DELAY:
>
> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>
> NOW I RESTART THE MONITOR:
>
> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>
> AND PASS BYSELF TO ONLINE:
>
> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>

Revision history for this message

ufaria (ufaria-prisacom) wrote on 2009-10-23:

#4

Download full text (15.6 KiB)

where is suppossed to be placed the debug directive? i did place in
the common file but logs the same

Walter Heck escribió:
> Could you run in DEBUG mode and add output here? Just put 'debug 1'
> line in your configs an restart them. I'd be curiouse to see what's
> up..
>
> Walter
>
> On Fri, Oct 23, 2009 at 18:03, ufaria <email address hidden> wrote:
>> Yes, i've waited a lot of minutes and tried to change to online
>> several times. the only workaround is restart the monitor.
>>
>>
>> Walter Heck escribió:
>>> I've seen this happen. Have you waited more then just a few seconds?
>>> It usually comes back up after 1 minute or so in all cases I have
>>> seen..
>>>
>>> Walter
>>>
>>> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>>>> Public bug reported:
>>>>
>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>> the node change automatically to "replication delay" status but the cluster is doing nothing. and
>>>> all checks are ok.
>>>>
>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>> pass it to online status.
>>>>
>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>
>>>> HERE I KILL THE NODE:
>>>>
>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkd...

where is suppossed to be placed the debug directive? i did place in
the common file but logs the same

Walter Heck escribió:
> Could you run in DEBUG mode and add output here? Just put 'debug 1'
> line in your configs an restart them. I'd be curiouse to see what's
> up..
> 
> Walter
> 
> On Fri, Oct 23, 2009 at 18:03, ufaria <ufaria@prisacom.com> wrote:
>> Yes, i've waited a lot of minutes and tried to change to online
>> several times. the only workaround is restart the monitor.
>>
>>
>> Walter Heck escribió:
>>> I've seen this happen. Have you waited more then just a few seconds?
>>> It usually comes back up after 1 minute or so in all cases I have
>>> seen..
>>>
>>> Walter
>>>
>>> On Fri, Oct 23, 2009 at 17:25, ufaria <ufaria@prisacom.com> wrote:
>>>> Public bug reported:
>>>>
>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>>> all checks are ok.
>>>>
>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>> pass it to online status.
>>>>
>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>
>>>> HERE I KILL THE NODE:
>>>>
>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>>>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>>
>>>> HERE THE NODE IS ONLINE:
>>>>
>>>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from
>>>> HARD_OFFLINE to AWAITING_RECOVERY
>>>>
>>>> HERE I CHANGE TO ONLINE :
>>>>
>>>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from
>>>> AWAITING_RECOVERY to ONLINE
>>>>
>>>> AND THEN, IT PASS TO REPLICATION DELAY:
>>>>
>>>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to
>>>> REPLICATION_DELAY
>>>>
>>>> NOW I RESTART THE MONITOR:
>>>>
>>>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>>
>>>> AND PASS BYSELF TO ONLINE:
>>>>
>>>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from
>>>> REPLICATION_DELAY to ONLINE
>>>>
>>>> ** Affects: mysql-mmm
>>>>     Importance: Undecided
>>>>         Status: New
>>>>
>>>> --
>>>> mmm-monitor don't pass to online state
>>>> https://bugs.launchpad.net/bugs/458907
>>>> You received this bug notification because you are subscribed to mysql-
>>>> mmm.
>>>>
>>>> Status in MMM for MySQL: New
>>>>
>>>> Bug description:
>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>>> all checks are ok.
>>>>
>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>> pass it to online status.
>>>>
>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>
>>>> HERE I KILL THE NODE:
>>>>
>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>>>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>>
>>>> HERE THE NODE IS ONLINE:
>>>>
>>>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>>>>
>>>> HERE I CHANGE TO ONLINE :
>>>>
>>>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>>>>
>>>> AND THEN, IT PASS TO REPLICATION DELAY:
>>>>
>>>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>>>>
>>>> NOW I RESTART THE MONITOR:
>>>>
>>>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>>
>>>> AND PASS BYSELF TO ONLINE:
>>>>
>>>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>>>>
>> --
>> --------------------------
>>    Uxio Faria Giraldez
>> --------------------------
>>    SISTEMAS
>>    PRISACOM S.A.
>>    +34 913 537 770
>> ==========================
>>
>> --
>> mmm-monitor don't pass to online state
>> https://bugs.launchpad.net/bugs/458907
>> You received this bug notification because you are subscribed to mysql-
>> mmm.
>>
>> Status in MMM for MySQL: New
>>
>> Bug description:
>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>> all checks are ok.
>>
>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>> pass it to online status.
>>
>> i have debian 5.0.3, the perl modules are all of the distribution.
>>
>> HERE I KILL THE NODE:
>>
>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>
>> HERE THE NODE IS ONLINE:
>>
>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>>
>> HERE I CHANGE TO ONLINE :
>>
>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>>
>> AND THEN, IT PASS TO REPLICATION DELAY:
>>
>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>>
>> NOW I RESTART THE MONITOR:
>>
>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>
>> AND PASS BYSELF TO ONLINE:
>>
>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>>
>

-- 
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Revision history for this message

Walter Heck (walterheck) wrote on 2009-10-23:

#5

Download full text (20.8 KiB)

after you do that, restart the agents and monitor that you changed the
config for. It should then start outputting debug messages to the
console.

Walter

On Fri, Oct 23, 2009 at 18:53, ufaria <email address hidden> wrote:
> where is suppossed to be placed the debug directive? i did place in
> the common file but logs the same
>
> Walter Heck escribió:
>> Could you run in DEBUG mode and add output here? Just put 'debug 1'
>> line in your configs an restart them. I'd be curiouse to see what's
>> up..
>>
>> Walter
>>
>> On Fri, Oct 23, 2009 at 18:03, ufaria <email address hidden> wrote:
>>> Yes, i've waited a lot of minutes and tried to change to online
>>> several times. the only workaround is restart the monitor.
>>>
>>>
>>> Walter Heck escribió:
>>>> I've seen this happen. Have you waited more then just a few seconds?
>>>> It usually comes back up after 1 minute or so in all cases I have
>>>> seen..
>>>>
>>>> Walter
>>>>
>>>> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>>>>> Public bug reported:
>>>>>
>>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>>> the node change automatically to "replication delay" status but the cluster is doing nothing. and
>>>>> all checks are ok.
>>>>>
>>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>>> pass it to online status.
>>>>>
>>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>>
>>>>> HERE I KILL THE NODE:
>>>>>
>>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 20...

after you do that, restart the agents and monitor that you changed the
config for. It should then start outputting debug messages to the
console.

Walter

On Fri, Oct 23, 2009 at 18:53, ufaria <ufaria@prisacom.com> wrote:
> where is suppossed to be placed the debug directive? i did place in
> the common file but logs the same
>
> Walter Heck escribió:
>> Could you run in DEBUG mode and add output here? Just put 'debug 1'
>> line in your configs an restart them. I'd be curiouse to see what's
>> up..
>>
>> Walter
>>
>> On Fri, Oct 23, 2009 at 18:03, ufaria <ufaria@prisacom.com> wrote:
>>> Yes, i've waited a lot of minutes and tried to change to online
>>> several times. the only workaround is restart the monitor.
>>>
>>>
>>> Walter Heck escribió:
>>>> I've seen this happen. Have you waited more then just a few seconds?
>>>> It usually comes back up after 1 minute or so in all cases I have
>>>> seen..
>>>>
>>>> Walter
>>>>
>>>> On Fri, Oct 23, 2009 at 17:25, ufaria <ufaria@prisacom.com> wrote:
>>>>> Public bug reported:
>>>>>
>>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>>>> all checks are ok.
>>>>>
>>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>>> pass it to online status.
>>>>>
>>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>>
>>>>> HERE I KILL THE NODE:
>>>>>
>>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>>>>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>>>
>>>>> HERE THE NODE IS ONLINE:
>>>>>
>>>>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from
>>>>> HARD_OFFLINE to AWAITING_RECOVERY
>>>>>
>>>>> HERE I CHANGE TO ONLINE :
>>>>>
>>>>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from
>>>>> AWAITING_RECOVERY to ONLINE
>>>>>
>>>>> AND THEN, IT PASS TO REPLICATION DELAY:
>>>>>
>>>>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to
>>>>> REPLICATION_DELAY
>>>>>
>>>>> NOW I RESTART THE MONITOR:
>>>>>
>>>>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>>>
>>>>> AND PASS BYSELF TO ONLINE:
>>>>>
>>>>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from
>>>>> REPLICATION_DELAY to ONLINE
>>>>>
>>>>> ** Affects: mysql-mmm
>>>>>     Importance: Undecided
>>>>>         Status: New
>>>>>
>>>>> --
>>>>> mmm-monitor don't pass to online state
>>>>> https://bugs.launchpad.net/bugs/458907
>>>>> You received this bug notification because you are subscribed to mysql-
>>>>> mmm.
>>>>>
>>>>> Status in MMM for MySQL: New
>>>>>
>>>>> Bug description:
>>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>>>> all checks are ok.
>>>>>
>>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>>> pass it to online status.
>>>>>
>>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>>
>>>>> HERE I KILL THE NODE:
>>>>>
>>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>>>>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>>>
>>>>> HERE THE NODE IS ONLINE:
>>>>>
>>>>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>>>>>
>>>>> HERE I CHANGE TO ONLINE :
>>>>>
>>>>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>>>>>
>>>>> AND THEN, IT PASS TO REPLICATION DELAY:
>>>>>
>>>>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>>>>>
>>>>> NOW I RESTART THE MONITOR:
>>>>>
>>>>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>>>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>>>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>>>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>>>
>>>>> AND PASS BYSELF TO ONLINE:
>>>>>
>>>>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>>>>>
>>> --
>>> --------------------------
>>>    Uxio Faria Giraldez
>>> --------------------------
>>>    SISTEMAS
>>>    PRISACOM S.A.
>>>    +34 913 537 770
>>> ==========================
>>>
>>> --
>>> mmm-monitor don't pass to online state
>>> https://bugs.launchpad.net/bugs/458907
>>> You received this bug notification because you are subscribed to mysql-
>>> mmm.
>>>
>>> Status in MMM for MySQL: New
>>>
>>> Bug description:
>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>> all checks are ok.
>>>
>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>> pass it to online status.
>>>
>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>
>>> HERE I KILL THE NODE:
>>>
>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
>>> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>>>
>>> HERE THE NODE IS ONLINE:
>>>
>>> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>>>
>>> HERE I CHANGE TO ONLINE :
>>>
>>> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>>>
>>> AND THEN, IT PASS TO REPLICATION DELAY:
>>>
>>> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>>>
>>> NOW I RESTART THE MONITOR:
>>>
>>> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
>>> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
>>> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
>>> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>>>
>>> AND PASS BYSELF TO ONLINE:
>>>
>>> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>>>
>>
>
> --
> --------------------------
>    Uxio Faria Giraldez
> --------------------------
>    SISTEMAS
>    PRISACOM S.A.
>    +34 913 537 770
> ==========================
>
> --
> mmm-monitor don't pass to online state
> https://bugs.launchpad.net/bugs/458907
> You received this bug notification because you are subscribed to mysql-
> mmm.
>
> Status in MMM for MySQL: New
>
> Bug description:
> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
> all checks are ok.
>
> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
> pass it to online status.
>
> i have debian 5.0.3, the perl modules are all of the distribution.
>
> HERE I KILL THE NODE:
>
> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
> 2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
> 2009/10/22 21:05:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
>
> HERE THE NODE IS ONLINE:
>
> 2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
>
> HERE I CHANGE TO ONLINE :
>
> 2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
>
> AND THEN, IT PASS TO REPLICATION DELAY:
>
> 2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
>
> NOW I RESTART THE MONITOR:
>
> 2009/10/22 21:06:31  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:31  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:31  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:31  WARN rkdb2: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
> 2009/10/22 21:06:41  WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
> 2009/10/22 21:06:41  WARN rkdb1: auto_increment_increment (1) should be >= 2
> 2009/10/22 21:06:41  WARN rkdb2: auto_increment_increment (1) should be >= 2
>
> AND PASS BYSELF TO ONLINE:
>
> 2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
>

Revision history for this message

Sid Stuart (sidstuart) wrote on 2009-10-23:

#6

For what it is worth, we have seen this problem as well. Though I did not take the time to document it as well as Mr. Ufaria.

Revision history for this message

Pascal Hofmann (pascalhofmann) wrote on 2009-10-27:

#7

I'll have a look at it.

Revision history for this message

Pascal Hofmann (pascalhofmann) wrote on 2009-10-27:

#8

Please run in debug mode, like Walter said - can't reproduce the problem here.

Pascal Hofmann (pascalhofmann) on 2009-10-27

Changed in mysql-mmm:
status:	New → Incomplete

Revision history for this message

ufaria (ufaria-prisacom) wrote on 2009-10-27: Re: [Bug 458907] Re: mmm-monitor don't pass to online state

#9

I did put "debug 1" in the mmm_common file but doesn't write debug info.
Even more. I have this log files:

  mmmd_mon.error
  mmmd_mon.fatal
  mmmd_mon.info
  mmmd_mon.warn

But the error messages appears in all files, the warn logs appears in all, etc.
seems that the log code isn't ok.

I did install the tar.gz version of 2.0.9-1 on debian 5.0. ¿can be any incompatibility?

Pascal Hofmann escribió:
> Please run in debug mode, like Walter said - can't reproduce the problem
> here.
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Revision history for this message

Pascal Hofmann (pascalhofmann) wrote on 2009-10-27:

#10

Debug messages should appear on the screen if you restart mmmd_mon. They wont get written to any log file.

The logging code is ok:
- All messages with log level fatal or higher go to mmmd_mon.fatal
- All messages with log level error or higher go to mmmd_mon.error
- All messages with log level warn or higher go to mmmd_mon.warn
- All messages with log level info or higher go to mmmd_mon.info

Revision history for this message

ufaria (ufaria-prisacom) wrote on 2009-10-27:

#11

debug_mmm_mon.txt Edit (19.8 KiB, text/plain; name="debug_mmm_mon.txt")

Ok, i was mistaken, now i have debug log.
I tried to reproduce doing "reboot -nf" on a slave several times with no results
but on the first try with the writer role works.

I send you the log. I killed the writer node "rkdb1".

Pascal Hofmann escribió:
> Debug messages should appear on the screen if you restart mmmd_mon. They
> wont get written to any log file.
>
> The logging code is ok:
> - All messages with log level fatal or higher go to mmmd_mon.fatal
> - All messages with log level error or higher go to mmmd_mon.error
> - All messages with log level warn or higher go to mmmd_mon.warn
> - All messages with log level info or higher go to mmmd_mon.info
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Revision history for this message

Pascal Hofmann (pascalhofmann) wrote on 2009-10-27:

#12

The problem is, that mmmd_mon switches the host to online while the replication isn't working correct. It should wait until everythings fine again. At least if there is a master up and running.

Pascal Hofmann (pascalhofmann) on 2009-10-27

Changed in mysql-mmm:
status:	Incomplete → Confirmed
status:	Confirmed → In Progress
assignee:	nobody → Pascal Hofmann (pascalhofmann)

Pascal Hofmann (pascalhofmann) on 2009-10-27

Changed in mysql-mmm:
status:	In Progress → Fix Committed

Revision history for this message

ufaria (ufaria-prisacom) wrote on 2009-10-27:

#13

but i waited several minutes and then i was connect to nodes to verify the slave status
in both servers and were ok. until i didn't restart the monitor didn't pass to online status.

Pascal Hofmann escribió:
> The problem is, that mmmd_mon switches the host to online while the
> replication isn't working correct. It should wait until everythings fine
> again. At least if there is a master up and running.
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Revision history for this message

Pascal Hofmann (pascalhofmann) wrote on 2009-10-29:

#14

If you take a look at the log file, the check rep_threads was not OK for this host, but it was put into state AWAITING_RECOVERY nevertheless. This behaviour is fixed in the repository.

Revision history for this message

ufaria (ufaria-prisacom) wrote on 2009-10-29:

#15

in the debug log i see that this check change to OK state before
i put the node in online state but the node is in AWAITING_RECOVERY
yet and i don't know if the monitor takes care about this check in
this state.

Pascal Hofmann escribió:
> If you take a look at the log file, the check rep_threads was not OK for
> this host, but it was put into state AWAITING_RECOVERY nevertheless.
> This behaviour is fixed in the repository.
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Revision history for this message

Pascal Hofmann (pascalhofmann) wrote on 2009-10-29:

#16

(I'm talking about debug_mmm_mon.txt, there rep_threads does not change to OK for host rkdb1.)
The monitor does not take care of the checks rep_threads and rep_backlog when switching from HARD_OFFLINE to AWAITING_RECOVERY in version 2.0.9 - but it will in the next release. Then the problem should be gone.

Revision history for this message

ufaria (ufaria-prisacom) wrote on 2009-10-29:

#17

Download full text (5.8 KiB)

i saw this:

2009/10/27 17:07:22 ERROR Check 'rep_threads' on 'rkdb1' has failed for 10 seconds! Message: ERROR: Replication is broken
2009/10/27 17:07:22 DEBUG Pinging checker 'mysql'...
2009/10/27 17:07:22 DEBUG Checker 'mysql' is OK (OK: Pong!)
2009/10/27 17:07:22 DEBUG Pinging checker 'mysql'...
2009/10/27 17:07:22 DEBUG Checker 'mysql' is OK (OK: Pong!)
2009/10/27 17:07:22 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:22 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:23 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:23 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:23 DEBUG Pinging checker 'ping'...
2009/10/27 17:07:23 DEBUG Checker 'ping' is OK (OK: Pong!)
2009/10/27 17:07:23 DEBUG Pinging checker 'ping'...
2009/10/27 17:07:23 DEBUG Checker 'ping' is OK (OK: Pong!)
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(ONLINE,
reader(192.168.71.183),reader(192.168.71.184),writer(192.168.71.182), rkdb2)' to rkdb2 (192.168.71.181:9989)
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:338.45
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(AWAITING_RECOVERY, , rkdb2)' to rkdb1 (192.168.71.180:9989)
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:123.61
2009/10/27 17:07:24 DEBUG Listener: Connect!
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(ONLINE,
reader(192.168.71.183),reader(192.168.71.184),writer(192.168.71.182), rkdb2)' to rkdb2 (192.168.71.181:9989)
2009/10/27 17:07:24 DEBUG Listener: Disconnect!
2009/10/27 17:07:24 DEBUG Listener: Waiting for connection...
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:338.87
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(AWAITING_RECOVERY, , rkdb2)' to rkdb1 (192.168.71.180:9989)
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:124.02
2009/10/27 17:07:24 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:24 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:25 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:25 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:26 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:26 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Sending command 'SET_STATUS(ONLINE,
reader(192.168.71.183),reader(192.168.71.184),writer(192.168.71.182), rkdb2)' to rkdb2 (192.168.71.181:9989)
2009/10/27 17:07:27 DEBUG Received Answer: OK: Status applied successfully!|UP:341.45
2009/10/27 17:07:27 DEBUG Sending command 'SET_STATUS(AWAITING_RECOVERY, , rkdb2)' to rkdb1 (192.168.71.180:9989)
2009/10/27 17:07:27 DEBUG Received Answer: OK: Status applied successfully!|UP:126.61
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_backlog'...
2009/10/27 17:07:27 DEBUG Checker 'rep_backlog' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_backlog'...
2009/10/27 17:07:27 DEBUG Checker 'rep_backlog' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_threads'...
2009/10/27 17:07:27 DEBUG Checker 'rep_threads' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_threads'...
2009/10/27 17:07:27 DEBUG Checker 'rep_threads' is OK (OK: Pong!)
20...

mysql-mmm

mmm-monitor don't pass to online state

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Changed in mysql-mmm:
status:	Fix Committed → Fix Released