Watchdog intermittently returns "Host Unreachable" errors on should-be-OK pings

Bug #67868 reported by A. Karl Kornel
2
Affects Status Importance Assigned to Milestone
watchdog (Debian)
Fix Released
Unknown
watchdog (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

I have watchdog configured to ping 10.1.16.1 (the gateway address), and to check for traffic on eth0. I have the watchdog init script set to run watchdog with the '-q' and '-v' switches, so I can see what's going on without my machine being forcibly restarted.

Anywhere from %30 to %50 of the time, watchdog's pings come back as "Host unreachable". Howevever, this information seems to be inaccurate: If I run the `ping` command in a separate window, I can see 0% packet loss on pings to 10.1.16.1, even while watchdog is reporting problems.

Revision history for this message
In , Johnie Ingram (johnie) wrote : wuff wuff

forwarded 32547 <email address hidden>

Revision history for this message
In , David =?iso-8859-1?Q?H=E4rdeman?= (david-2gen) wrote : Status update?

Hi,

whats the status of this bug?

I can still see this behavious (5 years later), with a recent (2.6.8.1)
kernel and hardware watchdog (i8xx-tco).

As soon as I enable ping in the config file, watchdog sends a message
within one minute (approx) with the text:

Message from watchdog:
The system will be rebooted because of error 101!

And then proceeds to reboot (I set it to ping the IP of the machine
itself).

Regards,
David Härdeman
david@2gen.com

Revision history for this message
In , Michael Meskes (meskes) wrote : Re: Bug#32547: Status update?

On Fri, Aug 20, 2004 at 10:28:12PM +0200, David Härdeman wrote:
> whats the status of this bug?

Still the same I'm afraid.

> I can still see this behavious (5 years later), with a recent (2.6.8.1)
> kernel and hardware watchdog (i8xx-tco).

Actually I haven't been able to reproduce the bug for all of those 5
years. I once hoped it would disappear as soon as I finish my work for
version 6.0 but real world work load hasn't let me spend a minute on
this for at least the last year.

I'd take any help I can get.

Michael
--
Michael Meskes
Email: Michael at Fam-Meskes dot De
ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: <email address hidden>
Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!

Revision history for this message
In , David =?iso-8859-1?Q?H=E4rdeman?= (david-2gen) wrote :

On Wed, Oct 20, 2004 at 08:02:04AM +0200, Michael Meskes wrote:
>Actually I haven't been able to reproduce the bug for all of those 5
>years. I once hoped it would disappear as soon as I finish my work for
>version 6.0 but real world work load hasn't let me spend a minute on
>this for at least the last year.
>
>I'd take any help I can get.

I'll try to look into this during next week when I have some time to
spare, I guess that it will be somewhat easier if the bug is
reproducible (which it is for me).

I'll get back later when I (hopefully) have some more info. Would be
cool to be able to assist in fixing a bug of this vintage.

//David

Revision history for this message
In , David =?iso-8859-1?Q?H=E4rdeman?= (david-2gen) wrote : Worksforme

Hello again,

it turns out that I was mistaken, I thought that I could reproduce the
bug but it was rather a question of a misconfiguration in my case. Sorry
for that...

I wonder if the original reporter is still able to reproduce the bug?

//David

Revision history for this message
In , Michael Meskes (meskes) wrote : Re: Bug#32547: Status update?

On Fri, Aug 20, 2004 at 10:28:12PM +0200, David Härdeman wrote:
> I can still see this behavious (5 years later), with a recent (2.6.8.1)
> kernel and hardware watchdog (i8xx-tco).

Do you still see the bug? Do you have any time to debug it a little bit?
I still cannot reproduce it.

Michael
--
Michael Meskes
Email: Michael at Fam-Meskes dot De
ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: <email address hidden>
Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!

Revision history for this message
In , David =?iso-8859-1?Q?H=E4rdeman?= (david-2gen) wrote :

On Wed, Feb 09, 2005 at 07:22:31PM +0100, Michael Meskes wrote:
>On Fri, Aug 20, 2004 at 10:28:12PM +0200, David Härdeman wrote:
>> I can still see this behavious (5 years later), with a recent (2.6.8.1)
>> kernel and hardware watchdog (i8xx-tco).
>
>Do you still see the bug? Do you have any time to debug it a little bit?
>I still cannot reproduce it.
>
>Michael

I'm sorry, as I said later in that bug, I cannot reproduce it. It turned
out that my problems where misconfigurations (such as machine setup to
ping itself while at the same time not replying to pings, overly
restrictive firewalling etc).

This is also indicated by the error message which I got from watchdog
(101 = ENETUNREACH = Network is unreachable). Now, the bug reporter got
106, which is probably EISCONN = Transport endpoint is already
connected, which I've only seen either when calling connect twice on a
socket, or when using sendto with a destination address on a
connect():ed socket.

I've been through the latest sources trying to find something fishy that
would cause the problems reported. The only thing I found which looks a
bit weird is this:

watchdog.c, line 562, setsockopt uses hold as an argument, but hold has
not been initialized, which might mean that broadcast packets are in
fact not enabled, right?

That shouldn't account for the 106 error though, maybe the original bug
has been fixed since it was reported?

Re,
David

Revision history for this message
In , Omen Wild (dbug3-flibble) wrote : Reproducible: ping causes reboot: 'network is unreachable (target: 192.168.1.1)'

Package: watchdog
Version: 5.2.4-4
Followup-For: Bug #32547

I can reproduce this bug 100% of the time, although it can take a while
before it happens. In my last 5 runs I had 318, 322, 320, 46, and 38
successful pings before it finally croaked. I put abort()'s in the
check_net function to get a core when it finally fails. The core shows
that it is the return at net.c:141 that is causing the problem. I also
have a full strace, but the only thing that I can determine is that the
sendto/recvfrom near net.c:71 is happening twice for every 'got answer
from target' message. Eventually all three loops fail and check_net
returns ENETUNREACH.

This is on a fairly lightly loaded server. The IP address in question
is for my internal network and is up for the entire duration of the
test.

There is something funky going on in check_net. If you run it under a
debugger and step through the code, you will find that the for loop
runs twice, the first time through the loop, the value of the received
packet's type is still 8 (i.e. ICMP_ECHO) at net.c:127. The second
time through the value is correct (ICMP_ECHOREPLY) and the loop
terminates by returning ENOERR. The comments to the ping program
indicate this is normal behavior:

--- begin ping comments ---
 * parse_reply --
 * Print out the packet, if it came from us. This logic is necessary
 * because ALL readers of the ICMP socket get a copy of ALL ICMP packets
 * which arrive ('tis only fair). This permits multiple copies of this
 * program to be run without having intermingled output (or statistics!).
--- end ping comments

Maybe it fails all three times through the loop when another ping
program is running. Yes! I set up a 'ping -f 192.168.1.1' and the
watchdog bailed with only 2 successful pings! So check_net needs to
get reworked to keep doing recvfrom's until there are not any more,
checking each one to see if the watchdog set it.

I'll see if I can whip up a patch in the next couple weeks. Or I'll
check back and see if some beats me to it. :)

Omen Wild

-- System Information:
Debian Release: 3.1
Architecture: i386 (i686)
Shell: /bin/sh linked to /bin/dash
Kernel: Linux 2.6.12-1
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)

Versions of packages watchdog depends on:
ii libc6 2.3.2.ds1-22 GNU C Library: Shared libraries an
ii makedev 2.3.1-78 creates device files in /dev

-- no debconf information

--
I found out why my car was humming. It had forgotten the words.

Revision history for this message
In , Michael Meskes (meskes) wrote :

forwarded 361839 <email address hidden>
merge 361839 32547

Revision history for this message
In , James Harper (james-harper) wrote : watchdog: ping doesn't work

If anyone is interested, I have a patch that fixes this.

James

Revision history for this message
In , Michael Meskes (meskes) wrote : Bug#32547: fixed in watchdog 5.2.6-1

Source: watchdog
Source-Version: 5.2.6-1

We believe that the bug you reported is fixed in the latest version of
watchdog, which is due to be installed in the Debian FTP archive:

watchdog_5.2.6-1.diff.gz
  to pool/main/w/watchdog/watchdog_5.2.6-1.diff.gz
watchdog_5.2.6-1.dsc
  to pool/main/w/watchdog/watchdog_5.2.6-1.dsc
watchdog_5.2.6-1_i386.deb
  to pool/main/w/watchdog/watchdog_5.2.6-1_i386.deb
watchdog_5.2.6.orig.tar.gz
  to pool/main/w/watchdog/watchdog_5.2.6.orig.tar.gz

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Michael Meskes <email address hidden> (supplier of updated watchdog package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Thu, 22 Jun 2006 20:50:01 +0200
Source: watchdog
Binary: watchdog
Architecture: source i386
Version: 5.2.6-1
Distribution: unstable
Urgency: low
Maintainer: Michael Meskes <email address hidden>
Changed-By: Michael Meskes <email address hidden>
Description:
 watchdog - software watchdog
Closes: 32547 351398 361835 361839 367126 368774 372819
Changes:
 watchdog (5.2.6-1) unstable; urgency=low
 .
   * New upstream version, closes: #32547, #351398, #361839, #361835
   * Added French translation, closes: #368774
   * Added Portuguese translation, closes: #372819
   * Added missing db_stop to postinst, closes: #367126
   * Added udev to Depends: line as alternative to makedev
   * Added debconf-updatepo call to clean target in debian/rules
   * Fixed /etc/dafault/watchdog handling in postinst, bug reported by James
     Harper <email address hidden>
Files:
 7f1813d7beb626622c6306baade82eb8 589 admin extra watchdog_5.2.6-1.dsc
 43c33708ac07d458bdbd416812481bab 138446 admin extra watchdog_5.2.6.orig.tar.gz
 725b822a3d460b3100ad9e7aaff4bda7 752 admin extra watchdog_5.2.6-1.diff.gz
 f1eeb0ac792a8461bb8f4591c2c2ee8d 60442 admin extra watchdog_5.2.6-1_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQFEm6RmVkEm8inxm9ERAgsUAJwL/lYlhC/jKBGSmi6TOU46x1j3CgCeO/tN
SMtvHjBibZo5/QYbFwKPbow=
=gtH6
-----END PGP SIGNATURE-----

Revision history for this message
A. Karl Kornel (akkornel) wrote :

I have watchdog configured to ping 10.1.16.1 (the gateway address), and to check for traffic on eth0. I have the watchdog init script set to run watchdog with the '-q' and '-v' switches, so I can see what's going on without my machine being forcibly restarted.

Anywhere from %30 to %50 of the time, watchdog's pings come back as "Host unreachable". Howevever, this information seems to be inaccurate: If I run the `ping` command in a separate window, I can see 0% packet loss on pings to 10.1.16.1, even while watchdog is reporting problems.

Changed in watchdog:
status: Unknown → Fix Released
Revision history for this message
William Grant (wgrant) wrote :

Fixed in Debian, with the fix synced into Feisty.

Changed in watchdog:
status: Unconfirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.