Long term timer thread failure under Linux kernel 2.4.21-20.ELsmp #1 SMP

Bug #541201 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
mrk

Bug Description

If I start catime with a channel that will not be found and leave it running for a few days I see a failure in one of the timer threads related to use of a bad event semaphore identifier. Attempts to debug the situation have not been succesful as the debugger appears to also be confused. Is the entire Linux process scrambled at this point?

Also, why is the thread id -1218549152 below? Is that normal? I also wonder why the debugger segmentation faults when it terminates?

Additional information:
 (gdb) run fishy 1
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /home/hill/epicsR3.14/epics/base/bin/linux-x86/catime fishy 1
[Thread debugging using libthread_db enabled]
[New Thread -1218549152 (LWP 23850)]
[New Thread -1218552912 (LWP 23851)]
[New Thread -1229042768 (LWP 23852)]
Testing with 1 channels named fishy
channel connect test
[New Thread -1239536720 (LWP 23853)]
Detaching after fork from child process 23854.

Program received signal SIGINT, Interrupt.
[Switching to Thread -1218549152 (LWP 23850)]
0x00e5440b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) cont
Continuing.
[New Thread -1250026576 (LWP 23855)]
CA client library is unable to contact CA repeater after 50 tries.
Silence this message by starting a CA repeater daemon
or by calling ca_pend_event() and or ca_poll() more often.
pthread_cond_timedwait failed: error Invalid argument
[Switching to Thread -1229042768 (LWP 23852)]

Catchpoint 1 (exception thrown)
0x004d11e6 in __cxa_throw () from /usr/lib/libstdc++.so.5

(gdb) info threads
Cannot fetch general-purpose registers for thread -1229042768: generic error

(gdp) bt
#0 0x00138cdf in raise () from /lib/tls/libc.so.6

(gdb) quit
The program is running. Exit anyway? (y or n) y
Segmentation fault

~/epicsR3.14/epics/base$ uname -a
Linux santana 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux

Original Mantis Bug: mantis-139
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=139

Tags: libcom 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (10.6 KiB)

~/epicsR3.14/epics/base$ gdb catime
GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) run fishy 1
Starting program: /home/hill/epicsR3.14/epics/base/bin/linux-x86/catime fishy 1
[Thread debugging using libthread_db enabled]
[New Thread -1218524576 (LWP 3995)]
[New Thread -1218528336 (LWP 3998)]
[New Thread -1229018192 (LWP 3999)]
Testing with 1 channels named fishy
channel connect test
[New Thread -1239512144 (LWP 4000)]
Detaching after fork from child process 4001.

Program received signal SIGINT, Interrupt.
[Switching to Thread -1218524576 (LWP 3995)]
0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) const
Undefined command: "const". Try "help".
(gdb) cont
Continuing.
[New Thread -1250002000 (LWP 4002)]
CA client library is unable to contact CA repeater after 50 tries.
Silence this message by starting a CA repeater daemon
or by calling ca_pend_event() and or ca_poll() more often.
pthread_cond_timedwait failed: error Invalid argument
[Switching to Thread -1229018192 (LWP 3999)]

Catchpoint 1 (exception thrown)
0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
(gdb) bt
#0 0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
#1 0x00d7e54a in epicsEvent::wait (this=0x89534dc,
    timeOut=0.026991000000000001) at ../../../src/libCom/osi/epicsEvent.cpp:78
#2 0x00d8acf0 in timerQueueActive::run (this=0x8953488)
    at ../../../src/libCom/timer/timerQueueActive.cpp:69
#3 0x00d7ca12 in epicsThreadCallEntryPoint (pPvt=0x89534e4)
    at ../../../src/libCom/osi/epicsThread.cpp:41
#4 0x00d83174 in start_routine (arg=0x8953788)
    at ../../../src/libCom/osi/os/posix/osdThread.c:294
#5 0x00115dec in start_thread () from /lib/tls/libpthread.so.0
#6 0x0028719a in clone () from /lib/tls/libc.so.6

(gdb) info threads
  5 Thread -1250002000 (LWP 4002) 0x0011821d in pthread_cond_wait@@GLIBC_2.3.2
    () from /lib/tls/libpthread.so.0
  4 Thread -1239512144 (LWP 4000) 0x00287dce in recvfrom ()
   from /lib/tls/libc.so.6
* 3 Thread -1229018192 (LWP 3999) 0x007521e6 in __cxa_throw ()
   from /usr/lib/libstdc++.so.5
  2 Thread -1218528336 (LWP 3998) 0x0011821d in pthread_cond_wait@@GLIBC_2.3.2
    () from /lib/tls/libpthread.so.0
  1 Thread -1218524576 (LWP 3995) 0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0

(gdb) thread 1
[Switching to thread 1 (Thread -1218524576 (LWP 3995))]#0 0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
(gdb) bt
#0 0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#1 0x0029425d in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libc.so.6
#2 0x00d849b6 in epicsEventWaitWithTimeout (pev...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Hm..., very curious, it was invalid wnhen it crashed, but now its ok?

(gdb) print this->rescheduleEvent
$3 = {id = 0x8953610}
(gdb) call this->rescheduleEvent.wait (1.0)
$4 = false
(gdb) down
#1 0x00d7e54a in epicsEvent::wait (this=0x89534dc,
    timeOut=0.026991000000000001) at ../../../src/libCom/osi/epicsEvent.cpp:78
78 throw invalidSemaphore ();
(gdb) print *this
$5 = {id = 0x8953610}
(gdb) call epicsEventWaitWithTimeout (this->id,1.0)
$6 = epicsEventWaitTimeout

Revision history for this message
Jeff Hill (johill-lanl) wrote :

It looks like the POSIX version of epicsEventWaitWithTimeout occasionally returns a status that isnt epicsEventWaitOK or epicsEventWaitTimeout when the semaphore is valid.

I am starting to suspect that this is mantis-96 (posix thread system call returns EINTR) again (which Andrew may have prematurely closed). If that happened, the code below would return status that isnt epicsEventWaitOK or epicsEventWaitTimeout when the semaphore is valid.

epicsEventWaitStatus epicsEventWaitWithTimeout(epicsEventId pevent, double timeout)
{
    struct timespec wakeTime;
    int status = 0;
    int unlockStatus;

    status = pthread_mutex_lock(&pevent->mutex);
    checkStatusQuit(status,"pthread_mutex_lock","epicsEventWaitWithTimeout");
    if(!pevent->isFull) {
        convertDoubleToWakeTime(timeout,&wakeTime);
        status = pthread_cond_timedwait(
            &pevent->cond,&pevent->mutex,&wakeTime);
    }
    if(status==0) pevent->isFull = 0;
    unlockStatus = pthread_mutex_unlock(&pevent->mutex);
    checkStatusQuit(unlockStatus,"pthread_mutex_unlock","epicsEventWaitWithTimeout");
    if(status==0) return(epicsEventWaitOK);
    if(status==ETIMEDOUT) return(epicsEventWaitTimeout);
    checkStatus(status,"pthread_cond_timedwait");
    return(epicsEventWaitError);
}

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Marty Kraimer:

Note that the SUSV3 states for pthrae_mutec_lock'

These functions shall not return an error code of [EINTR].

Doesn't this mean the problem must be somthing besides a signal?

Revision history for this message
Jeff Hill (johill-lanl) wrote :

We can conclude that (A) some other errno is being returned or (B) this (very recent) version of RH Linux isn’t compliant with SUSV3.

At a minimum, the code should probably be changed to print a message including errno, or strerror(errno), when it is returning status indicating a bad semaphore?

After that change is in place I will need to wait 2 days until this occurs again, but that’s ok, I can wait.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I had a look at the SUSV3 also and apparently pthread_cond_timedwait() is allowed to return only ETIMEDOUT or EINVAL. So this *is* looking like a compliance issue. Nevertheless, the client library, IOC, GW etc will be failing roughly every two days so this probably needs to be dealt with.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Eric Norum,

Jeff Hill wrote:
> Eric,
>
> Be certain to scroll down and read the bug note entries at the bottom as
> they are the ones I created most recently - after I deduced the underlying
> cause of the problem.
>
> This may not be directly relevant to Mantis 139, but I should add this
> comment anyways. I seem to recall that you are blocking signals, but what is
> to stop users from just unblocking them again.

"Doctor, it hurts when I do this".
"So, don't do that".

There are library routines and packages not under our control which may not
take well to signals. My feeling is that users should not be mucking about
with signals unless they know what they're doing -- and that means only
unblocking signals when the user code is ready to handle them and when the
thread which has unblocked them is not in an EPICS, or any other, library routine.

Blocking signals from all but the main thread seems to me to be a simple and
prudent thing to do. If we hear an uproar from someone we can revist the
problem later.

Is that a robust solution?

I'd say that it's a reasonable solution and easy to describe.

My
> conclusion acquired after dealing with these issues in the CA client library
> is that all blocking system calls must be protected from interrupting
> circumstances by checking the status for EINTR and properly responding.

Yes, a good idea, but there is also code not under our control which must be
protected. Blocking signals is an easy way to ensure that this code is not
stressed unnecessarily.

I am
> aware of this issue because the CA client library has been living in the
> same process with many different user codes. That will also be true with the
> OSI stuff in libCom - it will not always be running in an IOC, and the
> natural tendency of users will be to do as they please, and be annoyed with
> any system that restricts their freedom.

I think that we should leave this until we know that it's really causing
problems for someone.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

There is certainly nothing wrong with blocking signals by default. What might be wrong would be using that default state as a safety net that justifies not dealing with EINTR in code that uses blocking POSIX calls.

Admittedly, the standard says that only ETIMEDOUT and EINVAL are returned, but in practice something else must be occurring, because in the debugger a subsequent call returns status indicating that the semaphore is ok.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Eric Norum,

After a little more thinking, my position is hardening. I don't think that it
is reasonable to clutter up the code everywhere with constructs like:
 do {
  status = someSysCall(..........);
 } while ((status == ERROR_STATUS_CODE) && (errno == EINTR));

Requiring the above is likely to cause lots of problems with drivers which
miss checking this every place it is needed.

Instead, signals should be blocked unless the user has good reason to allow
them -- and then only unblocked when the user code is about to call a small
subset of the EPICS libCom routines (epicsThreadSleep, epicsEventWait,
epicsEventTimedWait, and maybe a few others). Thus only this documented
subset of routines need to worry about EINTR returns.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

The CA client library is certainly cluttered just as you describe, but this only occurs with socket calls that block such as send() recv(), or select(). These calls already need to have substantial code dealing with all of the many possible error conditions.

This has alway been a contraversial issue with UNIX. Fortunately, today POSIX is clear about what calls need EINTR protection.

Our problem here initially appears to be incompliance, but the problem is hopefully restricted to pthread_cond_timedwait and or pthread_mutex_lock.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Below is the man entry for pthread_cond_timedwait on the very recent Linux system where the problem occurs. The same content on EINTR is in the man pages of a less recent Linux system where I have been so-far unable to reproduce the problem.

---------snip-----snip----------------

The pthread_cond_timedwait function returns the following error codes
on error:

ETIMEDOUT
the condition variable was not signaled until the timeout
specified by abstime

EINTR pthread_cond_timedwait was interrupted by a signal

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I am running a long term test to verify that the problem is fixed

Revision history for this message
mrk (mrk-aps) wrote :

probably resolved 3.14.7

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I left catime running on the same Linux system for about one day, and unfortunately the following occurred.

catime fishfood
Testing with 10000 channels named fishfood
channel connect test
pthread_cond_timedwait failed: error Invalid argument
Aborted

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Is the "error Invalid argument" diagnostic a clue? Apparently, my guess that errno==EINTR was dead wrong. Is it possible that the delay passed in to pthread_cond_timedwait has been allowed to go out of range? There should only be a limited set of circumstances that could lead to errorno="Invalid argument". I think that I have already ruled out an invalid semaphore id (see my notes in Mantis). So perhaps we can deduce which one it is based on a process of elimination.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From: Marty Kraimer:

Note that this is the same error I got leaving a CA client running overnight

Marty Kraimer wrote:
I left dbcaPerform running overnight. When I came in this morning I found:

mercury% ../../bin/solaris-sparc/changeLinks 100 5 0xf
pthread_cond_timedwait failed: error Invalid argument
epicsThread: Unexpected C++ exception "epicsEvent::invalidSemaphore()"
with type "epicsEvent::invalidSemaphore" in thread "timerQueue" at Wed
Oct 20 2004 22:02:27.974537800
epicsThread: Unexpected C++ exception "epicsEvent::invalidSemaphore()"
with type "epicsEvent::invalidSemaphore" in thread "timerQueue" at Wed
Oct 20 2004 22:02:27.974537800
Abort (core dumped)

I am adding some print statements to osdEvent to see if I can find out
more info and will try again.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I started my test this morning and it failed in less than an hour (while placing the test in the background to read a man page) as follows.

:pthread_cond_timedwait failed: error Invalid argument
epicsThread: Unexpected C++ exception "epicsEvent::invalidSemaphore()" with type "N10epicsEvent16invalidSemaphoreE" in thread "timerQueue" at Fri Oct 22 2004 09:17:47.977425000

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I see the bug.

The code is assuming that EINTR is in the status returned from the function. Thats not the case. The status will be -1 and EINTR will be in errno.

    int status;
    while(1) {
        status = pthread_cond_timedwait(condId,mutexId,time);
        if(status!=EINTR) return status;
        errlogPrintf("pthread_cond_timedwait returned EINTR. Violates SUSv3\\n");
    }

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From: Marty Kraimer

The standard states:

        RETURN VALUE

    Except in the case of [ETIMEDOUT], all these error checks shall act
    as if they were performed immediately at the beginning of processing
    for the function and shall cause an error return, in effect, prior
    to modifying the state of the mutex specified by mutex or the
    condition variable specified by cond.

    Upon successful completion, a value of zero shall be returned;
    otherwise, an error number shall be returned to indicate the error.

        ERRORS

    The pthread_cond_timedwait() function shall fail if:

    [ETIMEDOUT]
        The time specified by abstime to pthread_cond_timedwait() has
    passed.[EINVAL]
    The value specified by abstime is invalid.

    These functions may fail if:

    [EINVAL]
    The value specified by cond or mutex is invalid.[EPERM]
        The mutex was not owned by the current thread at the time of the
    call.

    These functions shall not return an error code of [EINTR].

------------------------------------------------------------------------

So what does it mean?

I looked at "Programming with POSIX Threads" and it shows that the
pthread_xxx routines return the error status instead of putting it in errno.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I was looking at the VERY old book "Guide to DEC Threads" and apparenetly this has changed as that book indicated that the error code was in errno as is posix convention.

The Linux man page states that "all condition variable functions return 0 on success and a non-zero error code on error". I wasnt certain if that meant "returns -1 on error and sets errno as is POSIX convention". Thats why I looked at the "Guide to DEC Threads" book. I should have looked in the Single UNIX Specification.

So we still have a mystery.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

New theory....

The function convertDoubleToWakeTime has the following code. It looks like it should be "if(wakeTime->tv_nsec>=1000000000L". A ">=" is required because it is invalid for there to be more than 999999999 nanoseconds in the fractional part.

    if(wakeTime->tv_nsec>1000000000L) {
        wakeTime->tv_nsec -= 1000000000L;
        ++wakeTime->tv_sec;
    }

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Marty

> > Good news!!!
> > With the change Jeff suggested to convertDoubleToWakeTime, i.e. dont let
> > tv_nsec==1000000000L
> > the changeLinks test ran all night without failing.

Revision history for this message
mrk (mrk-aps) wrote :

fixed in release 3.14.7

The problem was in convertDoubleToWakeTime

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.7 Released

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.