mmm-monitor don't pass to online state

Bug #458907 reported by ufaria on 2009-10-23
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
mysql-mmm
Undecided
Pascal Hofmann

Bug Description

when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
the node change automatically to "replication delay" status but the cluster is doing nothing. and
all checks are ok.

the only "workaround" is restart mmm-monitor to make that checks again the node ok and
pass it to online status.

i have debian 5.0.3, the perl modules are all of the distribution.

HERE I KILL THE NODE:

2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
2009/10/22 21:04:52 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
2009/10/22 21:05:11 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
2009/10/22 21:05:17 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)
2009/10/22 21:05:23 FATAL Agent on host 'rkdb2' is reachable again
2009/10/22 21:05:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (111)

HERE THE NODE IS ONLINE:

2009/10/22 21:05:53 FATAL State of host 'rkdb2' changed from HARD_OFFLINE to AWAITING_RECOVERY

HERE I CHANGE TO ONLINE :

2009/10/22 21:06:01 FATAL Admin changed state of 'rkdb2' from AWAITING_RECOVERY to ONLINE

AND THEN, IT PASS TO REPLICATION DELAY:

2009/10/22 21:06:02 FATAL State of host 'rkdb2' changed from ONLINE to REPLICATION_DELAY

NOW I RESTART THE MONITOR:

2009/10/22 21:06:31 WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
2009/10/22 21:06:31 WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
2009/10/22 21:06:31 WARN rkdb1: auto_increment_increment (1) should be >= 2
2009/10/22 21:06:31 WARN rkdb2: auto_increment_increment (1) should be >= 2
2009/10/22 21:06:41 WARN No binary found for killing hosts (/usr/bin/mysql-mmm//monitor/kill_host).
2009/10/22 21:06:41 WARN auto_increment_offset should be different on both masters (rkdb1: 1 , rkdb2: 1)
2009/10/22 21:06:41 WARN rkdb1: auto_increment_increment (1) should be >= 2
2009/10/22 21:06:41 WARN rkdb2: auto_increment_increment (1) should be >= 2

AND PASS BYSELF TO ONLINE:

2009/10/22 21:06:45 FATAL State of host 'rkdb2' changed from REPLICATION_DELAY to ONLINE

Download full text (9.6 KiB)

I've seen this happen. Have you waited more then just a few seconds?
It usually comes back up after 1 minute or so in all cases I have
seen..

Walter

On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
> Public bug reported:
>
> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
> all checks are ok.
>
> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
> pass it to online status.
>
> i have debian 5.0.3, the perl modules are all of the distribution.
>
> HERE I KILL THE NODE:
>
> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:11  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
> 2009/10/22 21:05:17  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.1...

Read more...

ufaria (ufaria-prisacom) wrote :
Download full text (10.0 KiB)

Yes, i've waited a lot of minutes and tried to change to online
several times. the only workaround is restart the monitor.

Walter Heck escribió:
> I've seen this happen. Have you waited more then just a few seconds?
> It usually comes back up after 1 minute or so in all cases I have
> seen..
>
> Walter
>
> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>> Public bug reported:
>>
>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>> the node change automatically to "replication delay" status but the cluster is doing nothing. and
>> all checks are ok.
>>
>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>> pass it to online status.
>>
>> i have debian 5.0.3, the perl modules are all of the distribution.
>>
>> HERE I KILL THE NODE:
>>
>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>> 2009/10/22 21:04:52 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>> 2009/10/22 21:05:11 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor...

Walter Heck (walterheck) wrote :
Download full text (15.1 KiB)

Could you run in DEBUG mode and add output here? Just put 'debug 1'
line in your configs an restart them. I'd be curiouse to see what's
up..

Walter

On Fri, Oct 23, 2009 at 18:03, ufaria <email address hidden> wrote:
> Yes, i've waited a lot of minutes and tried to change to online
> several times. the only workaround is restart the monitor.
>
>
> Walter Heck escribió:
>> I've seen this happen. Have you waited more then just a few seconds?
>> It usually comes back up after 1 minute or so in all cases I have
>> seen..
>>
>> Walter
>>
>> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>>> Public bug reported:
>>>
>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>> all checks are ok.
>>>
>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>> pass it to online status.
>>>
>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>
>>> HERE I KILL THE NODE:
>>>
>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:23  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>> 2009/10/22 21:04:52  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 1...

ufaria (ufaria-prisacom) wrote :
Download full text (15.6 KiB)

where is suppossed to be placed the debug directive? i did place in
the common file but logs the same

Walter Heck escribió:
> Could you run in DEBUG mode and add output here? Just put 'debug 1'
> line in your configs an restart them. I'd be curiouse to see what's
> up..
>
> Walter
>
> On Fri, Oct 23, 2009 at 18:03, ufaria <email address hidden> wrote:
>> Yes, i've waited a lot of minutes and tried to change to online
>> several times. the only workaround is restart the monitor.
>>
>>
>> Walter Heck escribió:
>>> I've seen this happen. Have you waited more then just a few seconds?
>>> It usually comes back up after 1 minute or so in all cases I have
>>> seen..
>>>
>>> Walter
>>>
>>> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>>>> Public bug reported:
>>>>
>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>> the node change automatically to "replication delay" status but the cluster is doing nothing. and
>>>> all checks are ok.
>>>>
>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>> pass it to online status.
>>>>
>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>
>>>> HERE I KILL THE NODE:
>>>>
>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:03:35 WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:43 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>> 2009/10/22 21:04:03 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:23 WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>> 2009/10/22 21:04:44 ERROR Check 'rep_threads' on 'rkd...

Walter Heck (walterheck) wrote :
Download full text (20.8 KiB)

after you do that, restart the agents and monitor that you changed the
config for. It should then start outputting debug messages to the
console.

Walter

On Fri, Oct 23, 2009 at 18:53, ufaria <email address hidden> wrote:
> where is suppossed to be placed the debug directive? i did place in
> the common file but logs the same
>
> Walter Heck escribió:
>> Could you run in DEBUG mode and add output here? Just put 'debug 1'
>> line in your configs an restart them. I'd be curiouse to see what's
>> up..
>>
>> Walter
>>
>> On Fri, Oct 23, 2009 at 18:03, ufaria <email address hidden> wrote:
>>> Yes, i've waited a lot of minutes and tried to change to online
>>> several times. the only workaround is restart the monitor.
>>>
>>>
>>> Walter Heck escribió:
>>>> I've seen this happen. Have you waited more then just a few seconds?
>>>> It usually comes back up after 1 minute or so in all cases I have
>>>> seen..
>>>>
>>>> Walter
>>>>
>>>> On Fri, Oct 23, 2009 at 17:25, ufaria <email address hidden> wrote:
>>>>> Public bug reported:
>>>>>
>>>>> when a node dies like "reboot -nf" after restart and pass the node to online status, sometimes
>>>>> the node change automatically  to "replication delay" status but the cluster is doing nothing. and
>>>>> all checks are ok.
>>>>>
>>>>> the only "workaround" is restart mmm-monitor to make that checks again the node ok and
>>>>> pass it to online status.
>>>>>
>>>>> i have debian 5.0.3, the perl modules are all of the distribution.
>>>>>
>>>>> HERE I KILL THE NODE:
>>>>>
>>>>> 2009/10/22 21:03:08 FATAL Can't reach agent on host 'rkdb2'
>>>>> 2009/10/22 21:03:10 ERROR Check 'ping' on 'rkdb2' has failed for 11 seconds! Message: ERROR: Could not ping 192.168.71.181
>>>>> 2009/10/22 21:03:11 FATAL State of host 'rkdb2' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK)
>>>>> 2009/10/22 21:03:11 ERROR Can't send offline status notification to 'rkdb2' - killing it!
>>>>> 2009/10/22 21:03:11 FATAL Could not kill host 'rkdb2' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
>>>>> 2009/10/22 21:03:15 ERROR Check 'mysql' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:15 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:16 ERROR Check 'rep_threads' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:03:35  WARN Check 'rep_backlog' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:43  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 2009/10/22 21:03:56 ERROR Check 'rep_backlog' on 'rkdb2' has failed for 14 seconds! Message: ERROR: Timeout
>>>>> 2009/10/22 21:04:03  WARN Check 'rep_threads' on 'rkdb2' is in unknown state! Message: UNKNOWN: Connect error (host = 192.168.71.181:3306, user = mmm_monitor)! Can't connect to MySQL server on '192.168.71.181' (113)
>>>>> 20...

Sid Stuart (sidstuart) wrote :

For what it is worth, we have seen this problem as well. Though I did not take the time to document it as well as Mr. Ufaria.

Pascal Hofmann (pascalhofmann) wrote :

I'll have a look at it.

Pascal Hofmann (pascalhofmann) wrote :

Please run in debug mode, like Walter said - can't reproduce the problem here.

Changed in mysql-mmm:
status: New → Incomplete

 I did put "debug 1" in the mmm_common file but doesn't write debug info.
 Even more. I have this log files:

  mmmd_mon.error
  mmmd_mon.fatal
  mmmd_mon.info
  mmmd_mon.warn

 But the error messages appears in all files, the warn logs appears in all, etc.
 seems that the log code isn't ok.

 I did install the tar.gz version of 2.0.9-1 on debian 5.0. ¿can be any incompatibility?

Pascal Hofmann escribió:
> Please run in debug mode, like Walter said - can't reproduce the problem
> here.
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Pascal Hofmann (pascalhofmann) wrote :

Debug messages should appear on the screen if you restart mmmd_mon. They wont get written to any log file.

The logging code is ok:
- All messages with log level fatal or higher go to mmmd_mon.fatal
- All messages with log level error or higher go to mmmd_mon.error
- All messages with log level warn or higher go to mmmd_mon.warn
- All messages with log level info or higher go to mmmd_mon.info

ufaria (ufaria-prisacom) wrote :

 Ok, i was mistaken, now i have debug log.
 I tried to reproduce doing "reboot -nf" on a slave several times with no results
 but on the first try with the writer role works.

 I send you the log. I killed the writer node "rkdb1".

Pascal Hofmann escribió:
> Debug messages should appear on the screen if you restart mmmd_mon. They
> wont get written to any log file.
>
> The logging code is ok:
> - All messages with log level fatal or higher go to mmmd_mon.fatal
> - All messages with log level error or higher go to mmmd_mon.error
> - All messages with log level warn or higher go to mmmd_mon.warn
> - All messages with log level info or higher go to mmmd_mon.info
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Pascal Hofmann (pascalhofmann) wrote :

The problem is, that mmmd_mon switches the host to online while the replication isn't working correct. It should wait until everythings fine again. At least if there is a master up and running.

Changed in mysql-mmm:
status: Incomplete → Confirmed
status: Confirmed → In Progress
assignee: nobody → Pascal Hofmann (pascalhofmann)
Changed in mysql-mmm:
status: In Progress → Fix Committed
ufaria (ufaria-prisacom) wrote :

but i waited several minutes and then i was connect to nodes to verify the slave status
in both servers and were ok. until i didn't restart the monitor didn't pass to online status.

Pascal Hofmann escribió:
> The problem is, that mmmd_mon switches the host to online while the
> replication isn't working correct. It should wait until everythings fine
> again. At least if there is a master up and running.
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Pascal Hofmann (pascalhofmann) wrote :

If you take a look at the log file, the check rep_threads was not OK for this host, but it was put into state AWAITING_RECOVERY nevertheless. This behaviour is fixed in the repository.

ufaria (ufaria-prisacom) wrote :

in the debug log i see that this check change to OK state before
i put the node in online state but the node is in AWAITING_RECOVERY
yet and i don't know if the monitor takes care about this check in
this state.

Pascal Hofmann escribió:
> If you take a look at the log file, the check rep_threads was not OK for
> this host, but it was put into state AWAITING_RECOVERY nevertheless.
> This behaviour is fixed in the repository.
>

--
--------------------------
    Uxio Faria Giraldez
--------------------------
    SISTEMAS
    PRISACOM S.A.
    +34 913 537 770
==========================

Pascal Hofmann (pascalhofmann) wrote :

(I'm talking about debug_mmm_mon.txt, there rep_threads does not change to OK for host rkdb1.)
The monitor does not take care of the checks rep_threads and rep_backlog when switching from HARD_OFFLINE to AWAITING_RECOVERY in version 2.0.9 - but it will in the next release. Then the problem should be gone.

ufaria (ufaria-prisacom) wrote :
Download full text (5.8 KiB)

i saw this:

2009/10/27 17:07:22 ERROR Check 'rep_threads' on 'rkdb1' has failed for 10 seconds! Message: ERROR: Replication is broken
2009/10/27 17:07:22 DEBUG Pinging checker 'mysql'...
2009/10/27 17:07:22 DEBUG Checker 'mysql' is OK (OK: Pong!)
2009/10/27 17:07:22 DEBUG Pinging checker 'mysql'...
2009/10/27 17:07:22 DEBUG Checker 'mysql' is OK (OK: Pong!)
2009/10/27 17:07:22 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:22 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:23 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:23 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:23 DEBUG Pinging checker 'ping'...
2009/10/27 17:07:23 DEBUG Checker 'ping' is OK (OK: Pong!)
2009/10/27 17:07:23 DEBUG Pinging checker 'ping'...
2009/10/27 17:07:23 DEBUG Checker 'ping' is OK (OK: Pong!)
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(ONLINE,
reader(192.168.71.183),reader(192.168.71.184),writer(192.168.71.182), rkdb2)' to rkdb2 (192.168.71.181:9989)
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:338.45
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(AWAITING_RECOVERY, , rkdb2)' to rkdb1 (192.168.71.180:9989)
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:123.61
2009/10/27 17:07:24 DEBUG Listener: Connect!
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(ONLINE,
reader(192.168.71.183),reader(192.168.71.184),writer(192.168.71.182), rkdb2)' to rkdb2 (192.168.71.181:9989)
2009/10/27 17:07:24 DEBUG Listener: Disconnect!
2009/10/27 17:07:24 DEBUG Listener: Waiting for connection...
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:338.87
2009/10/27 17:07:24 DEBUG Sending command 'SET_STATUS(AWAITING_RECOVERY, , rkdb2)' to rkdb1 (192.168.71.180:9989)
2009/10/27 17:07:24 DEBUG Received Answer: OK: Status applied successfully!|UP:124.02
2009/10/27 17:07:24 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:24 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:25 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:25 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:26 DEBUG Pinging checker 'ping_ip'...
2009/10/27 17:07:26 DEBUG Checker 'ping_ip' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Sending command 'SET_STATUS(ONLINE,
reader(192.168.71.183),reader(192.168.71.184),writer(192.168.71.182), rkdb2)' to rkdb2 (192.168.71.181:9989)
2009/10/27 17:07:27 DEBUG Received Answer: OK: Status applied successfully!|UP:341.45
2009/10/27 17:07:27 DEBUG Sending command 'SET_STATUS(AWAITING_RECOVERY, , rkdb2)' to rkdb1 (192.168.71.180:9989)
2009/10/27 17:07:27 DEBUG Received Answer: OK: Status applied successfully!|UP:126.61
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_backlog'...
2009/10/27 17:07:27 DEBUG Checker 'rep_backlog' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_backlog'...
2009/10/27 17:07:27 DEBUG Checker 'rep_backlog' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_threads'...
2009/10/27 17:07:27 DEBUG Checker 'rep_threads' is OK (OK: Pong!)
2009/10/27 17:07:27 DEBUG Pinging checker 'rep_threads'...
2009/10/27 17:07:27 DEBUG Checker 'rep_threads' is OK (OK: Pong!)
20...

Read more...

Pascal Hofmann (pascalhofmann) wrote :

2009/10/27 17:07:32 DEBUG Checker 'rep_threads' is OK (OK: Pong!)
Is just a message that indicates that process that does the check 'rep_threads' is ok. It does not indicate, that rep_threads is ok on a/all hosts.

Changed in mysql-mmm:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers