mmm-monitor don't pass to online state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
mysql-mmm |
Fix Released
|
Undecided
|
Pascal Hofmann |
Bug Description
when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
the node change automatically to "replication delay" status but the cluster is doing nothing. and
all checks are ok.
the only "workaround" is restart mmm-monitor to make that checks again the node ok and
pass it to online status.
i have debian 5.0.3, the perl modules are all of the distribution.
HERE I KILL THE NODE:
2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:04:52 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
2009/10/22 21:05:11 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
2009/10/22 21:05:17 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
2009/10/22 21:05:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
HERE THE NODE IS ONLINE:
2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY
HERE I CHANGE TO ONLINE :
2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE
AND THEN, IT PASS TO REPLICATION DELAY:
2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY
NOW I RESTART THE MONITOR:
2009/10/22 21:06:31 WARN No binary found for killing hosts (/usr/bin/
2009/10/22 21:06:31 WARN auto_increment_
2009/10/22 21:06:31 WARN rkdb1: auto_increment_
2009/10/22 21:06:31 WARN rkdb2: auto_increment_
2009/10/22 21:06:41 WARN No binary found for killing hosts (/usr/bin/
2009/10/22 21:06:41 WARN auto_increment_
2009/10/22 21:06:41 WARN rkdb1: auto_increment_
2009/10/22 21:06:41 WARN rkdb2: auto_increment_
AND PASS BYSELF TO ONLINE:
2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE
Changed in mysql-mmm: | |
status: | New → Incomplete |
Changed in mysql-mmm: | |
status: | Incomplete → Confirmed |
status: | Confirmed → In Progress |
assignee: | nobody → Pascal Hofmann (pascalhofmann) |
Changed in mysql-mmm: | |
status: | In Progress → Fix Committed |
Changed in mysql-mmm: | |
status: | Fix Committed → Fix Released |
I've seen this happen. Have you waited more then just a few seconds?
It usually comes back up after 1 minute or so in all cases I have
seen..
Walter
On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote: 71.181: 3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113) 71.181: 3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113) 71.181: 3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113) 71.181: 3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113) 71.181: 3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113) 71.181: 3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> Public bug reported:
>
> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
> the node change automatically to "replication delay" status but the cluster is doing nothing. and
> all checks are ok.
>
> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
> pass it to online status.
>
> i have debian 5.0.3, the perl modules are all of the distribution.
>
> HERE I KILL THE NODE:
>
> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
> 2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:52 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
> 2009/10/22 21:05:11 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.
> 2009/10/22 21:05:17 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.1...