Excessive logging by apcsmart program

Bug #535583 reported by Lupe Christoph
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nut (Ubuntu)
Expired
Medium
Unassigned

Bug Description

Binary package hint: nut

/var/log/daemon.log is rapidly filled with lines like this:
Mar 10 00:27:44 alanya apcsmart[4944]: Communications with UPS lost: timeout
Mar 10 00:27:44 alanya upsd[4946]: Data for UPS [ups] is stale - check driver
Mar 10 00:27:44 alanya apcsmart[4944]: smartmode: ser_send_char failed: Input/output error
Mar 10 00:27:44 alanya apcsmart[4944]: Communications with UPS lost: Communications with UPS lost - check cabling
Mar 10 00:27:44 alanya apcsmart[4944]: smartmode: ser_send_char failed: Input/output error
Mar 10 00:27:44 alanya apcsmart[4944]: Communications with UPS lost: Communications with UPS lost - check cabling
...
Mar 10 08:27:27 alanya apcsmart[4944]: Communications with UPS lost: Communications with UPS lost - check cabling
Mar 10 08:27:27 alanya apcsmart[4944]: smartmode: ser_send_char failed: Input/output error
Mar 10 08:27:27 alanya apcsmart[4944]: last message repeated 9 times
...
Until the filesystem that contains /var/log fills up. In the one second Mar 10 00:27:45 it logged 8148 lines, with each with "smartmode: ser_send_char failed: Input/output error" "repeated 9 times". So apcsmart sent about 27000 messages per second to syslog.

I would like to ask apcsmart to take it easy and back off after an error. ;-)

Ubuntu 9.10
nut 2.4.1-3ubuntu2

I will not attach the complete daemon.log, It is almost one GB.

ProblemType: Bug
Architecture: amd64
Date: Wed Mar 10 09:50:54 2010
DistroRelease: Ubuntu 9.10
NonfreeKernelModules: fglrx
Package: nut 2.4.1-3ubuntu2
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.31-14.48-generic
SourcePackage: nut
Uname: Linux 2.6.31-14-generic x86_64

Tags: apport-bug
Revision history for this message
Lupe Christoph (lupe) wrote :
Thierry Carrez (ttx)
Changed in nut (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Chuck Short (zulcss) wrote :

This should be fixed for lucid. Thanks for the bug report.

Regards
chuck

Changed in nut (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Lupe Christoph (lupe) wrote :

Since the supposed fix, I have had several ne incidents of this bug.

I'm now running Maverick Meerkat, and the log shows this:

Feb 10 14:10:14 alanya apcsmart[3238]: Communications with UPS lost: Communications with UPS lost - check cabling
Feb 10 14:10:14 alanya apcsmart[3238]: smartmode: ser_send_char failed: Input/output error
Feb 10 14:10:14 alanya apcsmart[3238]: last message repeated 9 times

... and nauseam, or until the /var/log partition fills up. The nut package is up to data (2.4.3-1ubuntu5). Please reopen this bug.

Revision history for this message
Arnaud Quette (aquette) wrote : Re: [Bug 535583] Re: Excessive logging by apcsmart program

Hi Christoph,

2011/2/10 Lupe Christoph

> Since the supposed fix, I have had several ne incidents of this bug.
>
> I'm now running Maverick Meerkat, and the log shows this:
>
> Feb 10 14:10:14 alanya apcsmart[3238]: Communications with UPS lost:
> Communications with UPS lost - check cabling
> Feb 10 14:10:14 alanya apcsmart[3238]: smartmode: ser_send_char failed:
> Input/output error
> Feb 10 14:10:14 alanya apcsmart[3238]: last message repeated 9 times
>
> ... and nauseam, or until the /var/log partition fills up. The nut
> package is up to data (2.4.3-1ubuntu5). Please reopen this bug.
>

could you please send us a driver debug output for investigation purpose.
Ie, after having stopped the running NUT instance, launch:
$ sudo /lib/nut/apcsmart -DDDDD -a <device-id-from-ups.conf>

let it run for a few seconds / minutes (at least, have a couple of issue
reproduction), then break using Ctrl+C. Don't forget to relaunch NUT then...

cheers,
Arnaud
--
Linux / Unix Expert R&D - Eaton - http://powerquality.eaton.com
Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/
Debian Developer - http://www.debian.org
Free Software Developer - http://arnaud.quette.free.fr/

Revision history for this message
Lupe Christoph (lupe) wrote :

On Thursday, 2011-02-10 at 14:54:34 -0000, Arnaud Quette wrote:
> 2011/2/10 Lupe Christoph

> could you please send us a driver debug output for investigation purpose.
> Ie, after having stopped the running NUT instance, launch:
> $ sudo /lib/nut/apcsmart -DDDDD -a <device-id-from-ups.conf>

> let it run for a few seconds / minutes (at least, have a couple of issue
> reproduction), then break using Ctrl+C. Don't forget to relaunch NUT then...

Will do, but this can take a few days or even weeks.

Lupe Christoph
--
| It is a well-known fact in any organisation that, if you want a job |
| done, you should give it to someone who is already very busy. |
| Terry Pratchett, "Unseen Academicals" |

Changed in nut (Ubuntu):
status: Fix Released → Incomplete
Revision history for this message
Lupe Christoph (lupe) wrote :
Download full text (5.2 KiB)

Of course after a reboot, I forgot to stop nut and start apcsmart
with debugging. :-( So when the problem reoccurred, I didn't get the
debug output.

But I ran strace of the apcsmart process, you might be able to
use that instead.

write(4, "Y", 1) = -1 EIO (Input/output error)
sendto(3, "<27>Feb 13 00:18:03 apcsmart[321"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
select(7, [4 5 6], NULL, NULL, {1, 999837}) = 1 (in [4], left {1, 999835})
write(4, "Y", 1) = -1 EIO (Input/output error)
sendto(3, "<27>Feb 13 00:18:03 apcsmart[321"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
select(7, [4 5 6], NULL, NULL, {1, 999839}) = 1 (in [4], left {1, 999837})
write(4, "Y", 1) = -1 EIO (Input/output error)
sendto(3, "<27>Feb 13 00:18:03 apcsmart[321"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
select(7, [4 5 6], NULL, NULL, {1, 999818}) = 1 (in [4], left {1, 999816})
write(4, "Y", 1) = -1 EIO (Input/output error)
sendto(3, "<27>Feb 13 00:18:03 apcsmart[321"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
select(7, [4 5 6], NULL, NULL, {1, 999757}) = 1 (in [4], left {1, 999755})
write(4, "Y", 1) = -1 EIO (Input/output error)
sendto(3, "<27>Feb 13 00:18:03 apcsmart[321"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
select(7, [4 5 6], NULL, NULL, {1, 999838}) = 1 (in [4], left {1, 999836})
write(4, "Y", 1) = -1 EIO (Input/output error)
sendto(3, "<27>Feb 13 00:18:03 apcsmart[321"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
select(7, [4 5 6], NULL, NULL, {1, 999845}) = 1 (in [4], left {1, 999843})
write(4, "Y", 1) = -1 EIO (Input/output error)
sendto(3, "<27>Feb 13 00:18:03 apcsmart[321"..., 87, MSG_NOSIGNAL, NULL, 0) = 87
select(7, [4 5 6], NULL, NULL, {1, 999845}) = 1 (in [4], left {1, 999843})

I have no idea why there is an IO error on the serial port. When I
restart apcsmart, it works OK. But it does not seem to handle the
EIO, it just goes on and on looking for input on /dev/ttyS2 and
/var/run/nut/apcsmart-ups (on two FDs), but then ignores the input
available on /dev/ttyS2.

The 'Y' command tells the UPS to enter smart mode. I can imagine this
is the normal mode of operation, and apcsmart has been thrown off track
and tries to re-engage, but the code that is supposed to do this is
faulty. Or maybe the UPS has dropped the connection. When I unplug the
cable, the trace is similar but different:

write(4, "C", 1) = 1
nanosleep({0, 0}, NULL) = 0
select(5, [4], NULL, NULL, {3, 0}) = 1 (in [4], left {2, 891541})
read(4, "045.9\r\n", 64) = 7
write(4, "P", 1) = 1
nanosleep({0, 0}, NULL) = 0
select(5, [4], NULL, NULL, {3, 0}) = 1 (in [4], left {2, 892849})
read(4, "017.6\r\n", 64) = 7
write(4, "X", 1) = 1
nanosleep({0, 0}, NULL) = 0
select(5, [4], NULL, NULL, {3, 0}) = 1 (in [4], left {2, 957850})
read(4, "N", 64) = 1

Cable is now unplugged.

select(5, [4], NULL, NULL, {3, 0}) = 0 (Timeout)
sendto(3, "<28>Feb 13 00:42:05 apcsmart[249"..., 73, MSG_NOSIGNAL, NULL, 0) = 73
...

Read more...

Revision history for this message
Arnaud Quette (aquette) wrote :

I definitely need more info!
please reply to ALL:

- what is the exact model and date of manufacturing?
- are you sure this unit is ok?
- have you really checked the cabling or made the whole (cable + UPS) work
somehow (using APC's software or apcupsd)?
- what is the meantime between occurrences of these issues?
- is the device reachable (using upsc for example) between issues?

A driver debug output is really needed!
Note that I'm not the developer of this driver, nor have any acquaintance
with APC.

cheers
Arnaud

Revision history for this message
Lupe Christoph (lupe) wrote :

On Monday, 2011-02-14 at 21:54:20 -0000, Arnaud Quette wrote:
> I definitely need more info!
> please reply to ALL:

> - what is the exact model and date of manufacturing?

SmartUPS 300I NET. I have the serial number (GS9809283199) but no date.

> - are you sure this unit is ok?

You can't prove the absence of faults.

> - have you really checked the cabling or made the whole (cable + UPS) work
> somehow (using APC's software or apcupsd)?

Well, as I said this is working OK for days or weeks. Then something
happens that triggers a bug in apcsmart.

> - what is the meantime between occurrences of these issues?

I don;t have enough data. It's in the range of weeks or months.

> - is the device reachable (using upsc for example) between issues?

Sure, everything works fine.

> A driver debug output is really needed!

I'm running it again, but no promises. Reboots are much more frequent
than this misbehaviour.

> Note that I'm not the developer of this driver, nor have any acquaintance
> with APC.

Same here. Though I will probably try to locate this bug if we don;t
make progress with the debugging output, either because it does not tell
us enough or because I don't manage to capture it.

I would have thought finding the place in the code where it is trying to
reset the UPS connection wouldn't be this hard.

Lupe Christoph
--
| It is a well-known fact in any organisation that, if you want a job |
| done, you should give it to someone who is already very busy. |
| Terry Pratchett, "Unseen Academicals" |

Revision history for this message
Arnaud Quette (aquette) wrote : Re: [Bug 535583] Excessive logging by apcsmart program

2011/2/15 Lupe Christoph

> On Monday, 2011-02-14 at 21:54:20 -0000, Arnaud Quette wrote:
> > I definitely need more info!
> > please reply to ALL:
>
> > - what is the exact model and date of manufacturing?
>
> SmartUPS 300I NET. I have the serial number (GS9809283199) but no date.
>

 it seems to be a recent model.

> - are you sure this unit is ok?
>
> You can't prove the absence of faults.
>

this was related to the following question...

> > - have you really checked the cabling or made the whole (cable + UPS)
> work
> > somehow (using APC's software or apcupsd)?
>
> Well, as I said this is working OK for days or weeks. Then something
> happens that triggers a bug in apcsmart.
>

 quickly reading back the thread, I can't find these info...

> - what is the meantime between occurrences of these issues?
>
> I don;t have enough data. It's in the range of weeks or months.
>

as per your previous posts, this seemed more to be a matter of minutes /
hours.

> - is the device reachable (using upsc for example) between issues?
>
> Sure, everything works fine.
>
> > A driver debug output is really needed!
>
> I'm running it again, but no promises. Reboots are much more frequent
> than this misbehaviour.
>
> > Note that I'm not the developer of this driver, nor have any acquaintance
> > with APC.
>
> Same here. Though I will probably try to locate this bug if we don;t
> make progress with the debugging output, either because it does not tell
> us enough or because I don't manage to capture it.
>
> I would have thought finding the place in the code where it is trying to
> reset the UPS connection wouldn't be this hard.
>

this is not the problem. This code is in the smartmode() function of
apcsmart.c:
http://svn.debian.org/wsvn/nut/trunk/drivers/apcsmart.c

we see the 5 attempts to go to smart mode ('Y' command), but my aim is to
understand why it is failing, and how to cleanly solve this without
impacting support for other units.

Some more questions:
- how are you handling the device's permissions?
Refer to § II, section 3:
http://git.debian.org/?p=collab-maint/nut.git;a=blob_plain;f=debian/nut.README.Debian;hb=HEAD

cheers
Arnaud
--
Linux / Unix Expert R&D - Eaton - http://powerquality.eaton.com
Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/
Debian Developer - http://www.debian.org
Free Software Developer - http://arnaud.quette.free.fr/
--
Conseiller Municipal - Saint Bernard du Touvet

Revision history for this message
Lupe Christoph (lupe) wrote :

On Tuesday, 2011-02-15 at 14:16:58 +0100, Arnaud Quette wrote:

> > I would have thought finding the place in the code where it is trying to
> > reset the UPS connection wouldn't be this hard.

> this is not the problem. This code is in the smartmode() function of
> apcsmart.c:
> http://svn.debian.org/wsvn/nut/trunk/drivers/apcsmart.c

I'll have a look at that code.

> we see the 5 attempts to go to smart mode ('Y' command), but my aim is to
> understand why it is failing, and how to cleanly solve this without
> impacting support for other units.

Of course. The problem is that the program is sending the command
infinitely, probably because of the EIO.

> Some more questions:
> - how are you handling the device's permissions?
> Refer to § II, section 3:
> http://git.debian.org/?p=collab-maint/nut.git;a=blob_plain;f=debian/nut.README.Debian;hb=HEAD

/etc/udev/rules.d/zzzlpc.rules:
KERNEL=="ttyS2", OWNER="nut", GROUP="nut", MODE="0660"

The serial line is on a PCI board. It may be a problem of that board,
not the UPS. Which is cleared by closing the device.

Lupe Christoph
--
| It is a well-known fact in any organisation that, if you want a job |
| done, you should give it to someone who is already very busy. |
| Terry Pratchett, "Unseen Academicals" |

Revision history for this message
Lupe Christoph (lupe) wrote :

On Tuesday, 2011-02-15 at 13:16:58 -0000, Arnaud Quette wrote:

> this is not the problem. This code is in the smartmode() function of
> apcsmart.c:
> http://svn.debian.org/wsvn/nut/trunk/drivers/apcsmart.c

> we see the 5 attempts to go to smart mode ('Y' command), but my aim is to
> understand why it is failing, and how to cleanly solve this without
> impacting support for other units.

I found no code that does five attempts. But this code in main.c,
starting on Line 618:

        while (!exit_flag) {

                struct timeval timeout;

                gettimeofday(&timeout, NULL);
                timeout.tv_sec += poll_interval;

                upsdrv_updateinfo();

                while (!dstate_poll_fds(timeout, extrafd) && !exit_flag) {
                        /* repeat until time is up or extrafd has data */

upsdrv_updateinfo() calls smartmode().

dstate_poll_fds() checks if there is any file descriptor that is
"available". In our case:

select(7, [4 5 6], NULL, NULL, {1, 999837}) = 1 (in [4], left {1, 999835})

FD 4 is the serial line, which is passed to dstate_poll_fds() as
extrafd.

When there is data that can be read from the UPS no code in
dstate_poll_fds() reads from extrafd, there is only code that reads
from the other input FDs. The outer loop above also ignores extrafd.
exit_flag is never set, so it continues. And because there is an active
file descriptor, the select returns immediately (actually it takes two
microseconds).

The solution is to add code that reads all data from extrafd and discards
it because nobody asked for it. I would also close and reopen the serial
line in smartmode(). I would prepare a patch if I knew more about the
I/O abstractions used in the nut driver code. Sorry.

HTH,
Lupe Christoph
--
| It is a well-known fact in any organisation that, if you want a job |
| done, you should give it to someone who is already very busy. |
| Terry Pratchett, "Unseen Academicals" |

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for nut (Ubuntu) because there has been no activity for 60 days.]

Changed in nut (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Arnaud Quette (aquette) wrote :

Hi Lupe,

since we now have an apcsmart maintainer, I'm forwarding this issue to him.

@Michal: could you please have a look at this issue [1], and give us your
feeling?

cheers,
Arnaud
--
[1] https://bugs.launchpad.net/bugs/535583

2011/2/15 Lupe Christoph <email address hidden>

> On Tuesday, 2011-02-15 at 13:16:58 -0000, Arnaud Quette wrote:
>
> > this is not the problem. This code is in the smartmode() function of
> > apcsmart.c:
> > http://svn.debian.org/wsvn/nut/trunk/drivers/apcsmart.c
>
> > we see the 5 attempts to go to smart mode ('Y' command), but my aim is to
> > understand why it is failing, and how to cleanly solve this without
> > impacting support for other units.
>
> I found no code that does five attempts. But this code in main.c,
> starting on Line 618:
>
> while (!exit_flag) {
>
> struct timeval timeout;
>
> gettimeofday(&timeout, NULL);
> timeout.tv_sec += poll_interval;
>
> upsdrv_updateinfo();
>
> while (!dstate_poll_fds(timeout, extrafd) && !exit_flag) {
> /* repeat until time is up or extrafd has data */
>
>
>
> upsdrv_updateinfo() calls smartmode().
>
> dstate_poll_fds() checks if there is any file descriptor that is
> "available". In our case:
>
> select(7, [4 5 6], NULL, NULL, {1, 999837}) = 1 (in [4], left {1, 999835})
>
> FD 4 is the serial line, which is passed to dstate_poll_fds() as
> extrafd.
>
> When there is data that can be read from the UPS no code in
> dstate_poll_fds() reads from extrafd, there is only code that reads
> from the other input FDs. The outer loop above also ignores extrafd.
> exit_flag is never set, so it continues. And because there is an active
> file descriptor, the select returns immediately (actually it takes two
> microseconds).
>
> The solution is to add code that reads all data from extrafd and discards
> it because nobody asked for it. I would also close and reopen the serial
> line in smartmode(). I would prepare a patch if I knew more about the
> I/O abstractions used in the nut driver code. Sorry.
>
> HTH,
> Lupe Christoph
> --
> | It is a well-known fact in any organisation that, if you want a job |
> | done, you should give it to someone who is already very busy. |
> | Terry Pratchett, "Unseen Academicals" |
>

Revision history for this message
Arnaud Quette (aquette) wrote :

2011/4/21 Michal Soltys <email address hidden>

> On 11-04-21 10:34, Arnaud Quette wrote:
>
>> Hi Lupe,
>>
>> since we now have an apcsmart maintainer, I'm forwarding this issue to
>> him.
>>
>> @Michal: could you please have a look at this issue [1], and give us
>> your feeling?
>>
>> cheers,
>> Arnaud
>>
>
> --
>> [1] https://bugs.launchpad.net/bugs/535583
>>
>> 2011/2/15 Lupe Christoph <<email address hidden>
>> <mailto:<email address hidden>>>
>>
>>
> The suggestions are pretty fine.
>
> - flushing stale input (though at driver level)
>
> Certainly. I even added some flushes earlier, but haven't touched the
> updateinfo and/or the functions it calls yet. I'll add it along with
> forthcoming patches (icanon mode and the rest). Looking at the strace,
> flushing post-failure might be good idea in certain cases as well.
>
> - reopening serial port
>
> If the upper layers of nut don't disallow this kind of behaviour for some
> reason - it's good idea as well. Should be helpful in weird cases, and at
> the very least wouldn't hurt at all. If it would help in this particular
> case, hard to say.
>

none special. we already have to somehow do so with USB devices.

> - smartmode()
>
> TBH, I'm not sure why it diligently tries to enter SM 5 times. Pre-emptive
> flush + 'Y' + reasonable delay (icanon or not) should be all that is
> necessary. If we don't succeed, next attempt shouldn't miraculously (in
> theory) make much of a difference 1 second later ...
>
>
> Thanks for pointing out those issues.
>

thanks for taking care of it ;-)

cheers,
Arnaud
--
Linux / Unix Expert R&D - Eaton - http://powerquality.eaton.com
Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/
Debian Developer - http://www.debian.org
Free Software Developer - http://arnaud.quette.free.fr/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers