Bug #541201 (mantis-139) “Long term timer thread failure under L...” : Bugs : EPICS Base

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-07:

#1

Download full text (10.6 KiB)

~/epicsR3.14/epics/base$ gdb catime
GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) run fishy 1
Starting program: /home/hill/epicsR3.14/epics/base/bin/linux-x86/catime fishy 1
[Thread debugging using libthread_db enabled]
[New Thread -1218524576 (LWP 3995)]
[New Thread -1218528336 (LWP 3998)]
[New Thread -1229018192 (LWP 3999)]
Testing with 1 channels named fishy
channel connect test
[New Thread -1239512144 (LWP 4000)]
Detaching after fork from child process 4001.

Program received signal SIGINT, Interrupt.
[Switching to Thread -1218524576 (LWP 3995)]
0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) const
Undefined command: "const". Try "help".
(gdb) cont
Continuing.
[New Thread -1250002000 (LWP 4002)]
CA client library is unable to contact CA repeater after 50 tries.
Silence this message by starting a CA repeater daemon
or by calling ca_pend_event() and or ca_poll() more often.
pthread_cond_timedwait failed: error Invalid argument
[Switching to Thread -1229018192 (LWP 3999)]

Catchpoint 1 (exception thrown)
0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
(gdb) bt
#0 0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
#1 0x00d7e54a in epicsEvent::wait (this=0x89534dc,
    timeOut=0.026991000000000001) at ../../../src/libCom/osi/epicsEvent.cpp:78
#2 0x00d8acf0 in timerQueueActive::run (this=0x8953488)
    at ../../../src/libCom/timer/timerQueueActive.cpp:69
#3 0x00d7ca12 in epicsThreadCallEntryPoint (pPvt=0x89534e4)
    at ../../../src/libCom/osi/epicsThread.cpp:41
#4 0x00d83174 in start_routine (arg=0x8953788)
    at ../../../src/libCom/osi/os/posix/osdThread.c:294
#5 0x00115dec in start_thread () from /lib/tls/libpthread.so.0
#6 0x0028719a in clone () from /lib/tls/libc.so.6

(gdb) info threads
  5 Thread -1250002000 (LWP 4002) 0x0011821d in pthread_cond_wait@@GLIBC_2.3.2
    () from /lib/tls/libpthread.so.0
  4 Thread -1239512144 (LWP 4000) 0x00287dce in recvfrom ()
   from /lib/tls/libc.so.6
* 3 Thread -1229018192 (LWP 3999) 0x007521e6 in __cxa_throw ()
   from /usr/lib/libstdc++.so.5
  2 Thread -1218528336 (LWP 3998) 0x0011821d in pthread_cond_wait@@GLIBC_2.3.2
    () from /lib/tls/libpthread.so.0
  1 Thread -1218524576 (LWP 3995) 0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0

(gdb) thread 1
[Switching to thread 1 (Thread -1218524576 (LWP 3995))]#0 0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
(gdb) bt
#0 0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#1 0x0029425d in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib/tls/libc.so.6
#2 0x00d849b6 in epicsEventWaitWithTimeout (pev...

~/epicsR3.14/epics/base$ gdb catime
GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) run fishy 1
Starting program: /home/hill/epicsR3.14/epics/base/bin/linux-x86/catime fishy 1
[Thread debugging using libthread_db enabled]
[New Thread -1218524576 (LWP 3995)]
[New Thread -1218528336 (LWP 3998)]
[New Thread -1229018192 (LWP 3999)]
Testing with 1 channels named fishy
channel connect test
[New Thread -1239512144 (LWP 4000)]
Detaching after fork from child process 4001.

Program received signal SIGINT, Interrupt.
[Switching to Thread -1218524576 (LWP 3995)]
0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) const
Undefined command: "const".  Try "help".
(gdb) cont
Continuing.
[New Thread -1250002000 (LWP 4002)]
CA client library is unable to contact CA repeater after 50 tries.
Silence this message by starting a CA repeater daemon
or by calling ca_pend_event() and or ca_poll() more often.
pthread_cond_timedwait failed: error Invalid argument
[Switching to Thread -1229018192 (LWP 3999)]

Catchpoint 1 (exception thrown)
0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
(gdb) bt
#0  0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
#1  0x00d7e54a in epicsEvent::wait (this=0x89534dc,
    timeOut=0.026991000000000001) at ../../../src/libCom/osi/epicsEvent.cpp:78
#2  0x00d8acf0 in timerQueueActive::run (this=0x8953488)
    at ../../../src/libCom/timer/timerQueueActive.cpp:69
#3  0x00d7ca12 in epicsThreadCallEntryPoint (pPvt=0x89534e4)
    at ../../../src/libCom/osi/epicsThread.cpp:41
#4  0x00d83174 in start_routine (arg=0x8953788)
    at ../../../src/libCom/osi/os/posix/osdThread.c:294
#5  0x00115dec in start_thread () from /lib/tls/libpthread.so.0
#6  0x0028719a in clone () from /lib/tls/libc.so.6

(gdb) info threads
  5 Thread -1250002000 (LWP 4002)  0x0011821d in pthread_cond_wait@@GLIBC_2.3.2
    () from /lib/tls/libpthread.so.0
  4 Thread -1239512144 (LWP 4000)  0x00287dce in recvfrom ()
   from /lib/tls/libc.so.6
* 3 Thread -1229018192 (LWP 3999)  0x007521e6 in __cxa_throw ()
   from /usr/lib/libstdc++.so.5
  2 Thread -1218528336 (LWP 3998)  0x0011821d in pthread_cond_wait@@GLIBC_2.3.2
    () from /lib/tls/libpthread.so.0
  1 Thread -1218524576 (LWP 3995)  0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0

(gdb) thread 1
[Switching to thread 1 (Thread -1218524576 (LWP 3995))]#0  0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
(gdb) bt
#0  0x0011840b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#1  0x0029425d in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libc.so.6
#2  0x00d849b6 in epicsEventWaitWithTimeout (pevent=0x8952948,
    timeout=1.7976931348623157e+308)
    at ../../../src/libCom/osi/os/posix/osdEvent.c:124
#3  0x00d7e4e1 in epicsEvent::wait (this=0x8952898,
    timeOut=1.7976931348623157e+308)
    at ../../../src/libCom/osi/epicsEvent.cpp:72
#4  0x0037f691 in ca_client_context::blockForEventAndEnableCallbacks (
    this=0x8952820, event=@0x8952898, timeout=@0xbfffe4c8)
    at ../ca_client_context.cpp:620
#5  0x0037f171 in ca_client_context::pendIO (this=0x8952820,
    timeout=@0xbfffe518) at ../ca_client_context.cpp:519
#6  0x00362383 in ca_pend_io (timeout=0) at ../access.cpp:855
#7  0x08048e75 in test_search (pItems=0x8952128, iterations=1,
    pInlineIter=0xbfffe58c) at ../catime.c:115
#8  0x08049ad9 in timeIt (pfunc=0x8048deb <test_search>, pItems=0x8952128,
    iterations=1, nBytes=66) at ../catime.c:453
#9  0x08049edb in catime (channelName=0xbffff911 "fishy", channelCount=1,
    appNF=dontAppendNumber) at ../catime.c:542
#10 0x08048c0d in main (argc=3, argv=0xbfffe694) at ../catimeMain.c:43

(gdb) thread 2
[Switching to thread 2 (Thread -1218528336 (LWP 3998))]#0  0x0011821d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
(gdb) bt
#0  0x0011821d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#1  0x002941d6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#2  0x00d8485b in epicsEventWait (pevent=0x8953178)
    at ../../../src/libCom/osi/os/posix/osdEvent.c:105
#3  0x00d7e433 in epicsEvent::wait (this=0x895310c)
    at ../../../src/libCom/osi/epicsEvent.cpp:63
#4  0x00d7b4a7 in ipAddrToAsciiEnginePrivate::run (this=0x8952ce8)
    at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:241
#5  0x00d7ca12 in epicsThreadCallEntryPoint (pPvt=0x8953114)
    at ../../../src/libCom/osi/epicsThread.cpp:41
#6  0x00d83174 in start_routine (arg=0x89532f0)
    at ../../../src/libCom/osi/os/posix/osdThread.c:294
#7  0x00115dec in start_thread () from /lib/tls/libpthread.so.0
#8  0x0028719a in clone () from /lib/tls/libc.so.6

(gdb) thread 4
[Switching to thread 4 (Thread -1239512144 (LWP 4000))]#0  0x00287dce in recvfrom () from /lib/tls/libc.so.6
(gdb) bt
#0  0x00287dce in recvfrom () from /lib/tls/libc.so.6
#1  0x003703d4 in udpRecvThread::run (this=0x896c9d8) at ../udpiiu.cpp:352
#2  0x00d7ca12 in epicsThreadCallEntryPoint (pPvt=0x896c9e8)
    at ../../../src/libCom/osi/epicsThread.cpp:41
#3  0x00d83174 in start_routine (arg=0x896cb68)
    at ../../../src/libCom/osi/os/posix/osdThread.c:294
#4  0x00115dec in start_thread () from /lib/tls/libpthread.so.0
#5  0x0028719a in clone () from /lib/tls/libc.so.6

(gdb) thread 5
[Switching to thread 5 (Thread -1250002000 (LWP 4002))]#0  0x0011821d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
(gdb) bt
#0  0x0011821d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#1  0x002941d6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#2  0x00d8485b in epicsEventWait (pevent=0x896d608)
    at ../../../src/libCom/osi/os/posix/osdEvent.c:105
#3  0x00d6e6e6 in errlogThread () at ../../../src/libCom/error/errlog.c:458
#4  0x00d83174 in start_routine (arg=0x896dcf8)
    at ../../../src/libCom/osi/os/posix/osdThread.c:294
#5  0x00115dec in start_thread () from /lib/tls/libpthread.so.0
#6  0x0028719a in clone () from /lib/tls/libc.so.6

(gdb) thread 3
[Switching to thread 3 (Thread -1229018192 (LWP 3999))]#0  0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
(gdb) bt
#0  0x007521e6 in __cxa_throw () from /usr/lib/libstdc++.so.5
#1  0x00d7e54a in epicsEvent::wait (this=0x89534dc,
    timeOut=0.026991000000000001) at ../../../src/libCom/osi/epicsEvent.cpp:78
#2  0x00d8acf0 in timerQueueActive::run (this=0x8953488)
    at ../../../src/libCom/timer/timerQueueActive.cpp:69
#3  0x00d7ca12 in epicsThreadCallEntryPoint (pPvt=0x89534e4)
    at ../../../src/libCom/osi/epicsThread.cpp:41
#4  0x00d83174 in start_routine (arg=0x8953788)
    at ../../../src/libCom/osi/os/posix/osdThread.c:294
#5  0x00115dec in start_thread () from /lib/tls/libpthread.so.0
#6  0x0028719a in clone () from /lib/tls/libc.so.6
(gdb) up
#1  0x00d7e54a in epicsEvent::wait (this=0x89534dc,
    timeOut=0.026991000000000001) at ../../../src/libCom/osi/epicsEvent.cpp:78
78              throw invalidSemaphore ();
Current language:  auto; currently c++
(gdb) print *this
$1 = {id = 0x8953610}
(gdb) up
#2  0x00d8acf0 in timerQueueActive::run (this=0x8953488)
    at ../../../src/libCom/timer/timerQueueActive.cpp:69
69              this->rescheduleEvent.wait ( delay );
(gdb) print *this
$2 = {<epicsTimerQueueActive> = {<epicsTimerQueue> = {
      _vptr.epicsTimerQueue = 0xd9a568}, <No data fields>}, <epicsThreadRunable> = {_vptr.epicsThreadRunable = 0xd9a590}, <epicsTimerQueueNotify> = {
    _vptr.epicsTimerQueueNotify = 0xd9a5ac}, <timerQueueActiveMgrPrivate> = {
    _vptr.timerQueueActiveMgrPrivate = 0xd9a5c4, referenceCount = 1},
  queue = {<epicsTimerQueue> = {_vptr.epicsTimerQueue = 0xd9a818},
    timerFreeList = {mutex = {id = 0x8953538}, pFreeList = 0x896ce60,
      pChunkList = 0x896cc60}, timerForCFreeList = {mutex = {id = 0x8953570},
      pFreeList = 0x0, pChunkList = 0x0}, mutex = {id = 0x89535a8},
    cancelBlockingEvent = {id = 0x89535c0}, timerList = {pFirst = 0x896cca0,
      pLast = 0x896ce40, itemCount = 16}, notify = @0x8953490,
    pExpireTmr = 0x0, processThread = 0x0, cancelPending = false},
  rescheduleEvent = {id = 0x8953610}, exitEvent = {id = 0x8953660}, thread = {
    runable = @0x895348c, id = 0x8953788, mutex = {id = 0x89536d0}, event = {
      id = 0x89536e8}, exitEvent = {id = 0x8953738},
    pWaitReleaseFlag = 0xb6bea9db, begin = true, cancel = false,
    terminated = false}, sleepQuantum = 0.01, okToShare = false,
  exitFlag = false, terminateFlag = false}

(gdb) call epicsMutexShowAll ( 0,100 )
ellCount(&mutexList) 23 ellCount(&freeList) 0
epicsMutexId 0x89520a0 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x89520d8 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8952110 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x89527f0 source ../../../src/libCom/misc/epicsExit.c line 43
epicsMutexId 0x89528f8 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8952930 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8952c10 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8952c48 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8952c80 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8953160 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8953238 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8953420 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8953470 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8953538 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8953570 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x89535a8 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x89536d0 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x8953d90 source ../../../src/libCom/freeList/freeListLib.c line 48
epicsMutexId 0x8954588 source ../../../src/libCom/freeList/freeListLib.c line 48
epicsMutexId 0x896cab0 source ../../../src/libCom/osi/epicsMutex.cpp line 197
epicsMutexId 0x896d678 source ../../../src/libCom/error/errlog.c line 392
epicsMutexId 0x896d6b0 source ../../../src/libCom/error/errlog.c line 393
epicsMutexId 0x896d788 source ../../../src/libCom/error/errlog.c line 396

edited on: 2004-10-07 08:40

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-08:

#2

Hm..., very curious, it was invalid wnhen it crashed, but now its ok?

(gdb) print this->rescheduleEvent
$3 = {id = 0x8953610}
(gdb) call this->rescheduleEvent.wait (1.0)
$4 = false
(gdb) down
#1 0x00d7e54a in epicsEvent::wait (this=0x89534dc,
timeOut=0.026991000000000001) at ../../../src/libCom/osi/epicsEvent.cpp:78
78 throw invalidSemaphore ();
(gdb) print *this
$5 = {id = 0x8953610}
(gdb) call epicsEventWaitWithTimeout (this->id,1.0)
$6 = epicsEventWaitTimeout

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-08:

#3

It looks like the POSIX version of epicsEventWaitWithTimeout occasionally returns a status that isnt epicsEventWaitOK or epicsEventWaitTimeout when the semaphore is valid.

I am starting to suspect that this is mantis-96 (posix thread system call returns EINTR) again (which Andrew may have prematurely closed). If that happened, the code below would return status that isnt epicsEventWaitOK or epicsEventWaitTimeout when the semaphore is valid.

epicsEventWaitStatus epicsEventWaitWithTimeout(epicsEventId pevent, double timeout)
{
    struct timespec wakeTime;
    int status = 0;
    int unlockStatus;

    status = pthread_mutex_lock(&pevent->mutex);
    checkStatusQuit(status,"pthread_mutex_lock","epicsEventWaitWithTimeout");
    if(!pevent->isFull) {
        convertDoubleToWakeTime(timeout,&wakeTime);
        status = pthread_cond_timedwait(
            &pevent->cond,&pevent->mutex,&wakeTime);
    }
    if(status==0) pevent->isFull = 0;
    unlockStatus = pthread_mutex_unlock(&pevent->mutex);
    checkStatusQuit(unlockStatus,"pthread_mutex_unlock","epicsEventWaitWithTimeout");
    if(status==0) return(epicsEventWaitOK);
    if(status==ETIMEDOUT) return(epicsEventWaitTimeout);
    checkStatus(status,"pthread_cond_timedwait");
    return(epicsEventWaitError);
}

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#4

From Marty Kraimer:

Note that the SUSV3 states for pthrae_mutec_lock'

These functions shall not return an error code of [EINTR].

Doesn't this mean the problem must be somthing besides a signal?

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#5

We can conclude that (A) some other errno is being returned or (B) this (very recent) version of RH Linux isn’t compliant with SUSV3.

At a minimum, the code should probably be changed to print a message including errno, or strerror(errno), when it is returning status indicating a bad semaphore?

After that change is in place I will need to wait 2 days until this occurs again, but that’s ok, I can wait.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#6

I had a look at the SUSV3 also and apparently pthread_cond_timedwait() is allowed to return only ETIMEDOUT or EINVAL. So this *is* looking like a compliance issue. Nevertheless, the client library, IOC, GW etc will be failing roughly every two days so this probably needs to be dealt with.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#7

From Eric Norum,

Jeff Hill wrote:
> Eric,
>
> Be certain to scroll down and read the bug note entries at the bottom as
> they are the ones I created most recently - after I deduced the underlying
> cause of the problem.
>
> This may not be directly relevant to Mantis 139, but I should add this
> comment anyways. I seem to recall that you are blocking signals, but what is
> to stop users from just unblocking them again.

"Doctor, it hurts when I do this".
"So, don't do that".

There are library routines and packages not under our control which may not
take well to signals. My feeling is that users should not be mucking about
with signals unless they know what they're doing -- and that means only
unblocking signals when the user code is ready to handle them and when the
thread which has unblocked them is not in an EPICS, or any other, library routine.

Blocking signals from all but the main thread seems to me to be a simple and
prudent thing to do. If we hear an uproar from someone we can revist the
problem later.

Is that a robust solution?

I'd say that it's a reasonable solution and easy to describe.

My
> conclusion acquired after dealing with these issues in the CA client library
> is that all blocking system calls must be protected from interrupting
> circumstances by checking the status for EINTR and properly responding.

Yes, a good idea, but there is also code not under our control which must be
protected. Blocking signals is an easy way to ensure that this code is not
stressed unnecessarily.

I am
> aware of this issue because the CA client library has been living in the
> same process with many different user codes. That will also be true with the
> OSI stuff in libCom - it will not always be running in an IOC, and the
> natural tendency of users will be to do as they please, and be annoyed with
> any system that restricts their freedom.

I think that we should leave this until we know that it's really causing
problems for someone.

From Eric Norum,

Jeff Hill wrote:
> Eric,
> 
> Be certain to scroll down and read the bug note entries at the bottom as
> they are the ones I created most recently - after I deduced the underlying
> cause of the problem.
> 
> This may not be directly relevant to Mantis 139, but I should add this
> comment anyways. I seem to recall that you are blocking signals, but what is
> to stop users from just unblocking them again.

"Doctor, it hurts when I do this".
"So, don't do that".

There are library routines and packages not under our control which may not 
take well to signals.  My feeling is that users should not be mucking about 
with signals unless they know what they're doing -- and that means only 
unblocking signals when the user code is ready to handle them and when the 
thread which has unblocked them is not in an EPICS, or any other, library routine.

Blocking signals from all but the main thread seems to me to be a simple and 
prudent thing to do.  If we hear an uproar from someone we can revist the 
problem later.

Is that a robust solution?

I'd say that it's a reasonable solution and easy to describe.

My
> conclusion acquired after dealing with these issues in the CA client library
> is that all blocking system calls must be protected from interrupting
> circumstances by checking the status for EINTR and properly responding.

Yes, a good idea, but there is also code not under our control which must be 
protected.  Blocking signals is an easy way to ensure that this code is not 
stressed unnecessarily.

I am
> aware of this issue because the CA client library has been living in the
> same process with many different user codes. That will also be true with the
> OSI stuff in libCom - it will not always be running in an IOC, and the
> natural tendency of users will be to do as they please, and be annoyed with
> any system that restricts their freedom.

I think that we should leave this until we know that it's really causing 
problems for someone.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#8

There is certainly nothing wrong with blocking signals by default. What might be wrong would be using that default state as a safety net that justifies not dealing with EINTR in code that uses blocking POSIX calls.

Admittedly, the standard says that only ETIMEDOUT and EINVAL are returned, but in practice something else must be occurring, because in the debugger a subsequent call returns status indicating that the semaphore is ok.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#9

From Eric Norum,

After a little more thinking, my position is hardening. I don't think that it
is reasonable to clutter up the code everywhere with constructs like:
do {
status = someSysCall(..........);
} while ((status == ERROR_STATUS_CODE) && (errno == EINTR));

Requiring the above is likely to cause lots of problems with drivers which
miss checking this every place it is needed.

Instead, signals should be blocked unless the user has good reason to allow
them -- and then only unblocked when the user code is about to call a small
subset of the EPICS libCom routines (epicsThreadSleep, epicsEventWait,
epicsEventTimedWait, and maybe a few others). Thus only this documented
subset of routines need to worry about EINTR returns.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#10

The CA client library is certainly cluttered just as you describe, but this only occurs with socket calls that block such as send() recv(), or select(). These calls already need to have substantial code dealing with all of the many possible error conditions.

This has alway been a contraversial issue with UNIX. Fortunately, today POSIX is clear about what calls need EINTR protection.

Our problem here initially appears to be incompliance, but the problem is hopefully restricted to pthread_cond_timedwait and or pthread_mutex_lock.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-12:

#11

Below is the man entry for pthread_cond_timedwait on the very recent Linux system where the problem occurs. The same content on EINTR is in the man pages of a less recent Linux system where I have been so-far unable to reproduce the problem.

---------snip-----snip----------------

The pthread_cond_timedwait function returns the following error codes
on error:

ETIMEDOUT
the condition variable was not signaled until the timeout
specified by abstime

EINTR pthread_cond_timedwait was interrupted by a signal

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-20:

#12

I am running a long term test to verify that the problem is fixed

Revision history for this message

mrk (mrk-aps) wrote on 2004-10-20:

#13

probably resolved 3.14.7

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-21:

#14

I left catime running on the same Linux system for about one day, and unfortunately the following occurred.

catime fishfood
Testing with 10000 channels named fishfood
channel connect test
pthread_cond_timedwait failed: error Invalid argument
Aborted

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-21:

#15

Is the "error Invalid argument" diagnostic a clue? Apparently, my guess that errno==EINTR was dead wrong. Is it possible that the delay passed in to pthread_cond_timedwait has been allowed to go out of range? There should only be a limited set of circumstances that could lead to errorno="Invalid argument". I think that I have already ruled out an invalid semaphore id (see my notes in Mantis). So perhaps we can deduce which one it is based on a process of elimination.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-21:

#16

From: Marty Kraimer:

Note that this is the same error I got leaving a CA client running overnight

Marty Kraimer wrote:
I left dbcaPerform running overnight. When I came in this morning I found:

mercury% ../../bin/solaris-sparc/changeLinks 100 5 0xf
pthread_cond_timedwait failed: error Invalid argument
epicsThread: Unexpected C++ exception "epicsEvent::invalidSemaphore()"
with type "epicsEvent::invalidSemaphore" in thread "timerQueue" at Wed
Oct 20 2004 22:02:27.974537800
epicsThread: Unexpected C++ exception "epicsEvent::invalidSemaphore()"
with type "epicsEvent::invalidSemaphore" in thread "timerQueue" at Wed
Oct 20 2004 22:02:27.974537800
Abort (core dumped)

I am adding some print statements to osdEvent to see if I can find out
more info and will try again.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-22:

#17

I started my test this morning and it failed in less than an hour (while placing the test in the background to read a man page) as follows.

:pthread_cond_timedwait failed: error Invalid argument
epicsThread: Unexpected C++ exception "epicsEvent::invalidSemaphore()" with type "N10epicsEvent16invalidSemaphoreE" in thread "timerQueue" at Fri Oct 22 2004 09:17:47.977425000

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-22:

#18

I see the bug.

The code is assuming that EINTR is in the status returned from the function. Thats not the case. The status will be -1 and EINTR will be in errno.

    int status;
    while(1) {
        status = pthread_cond_timedwait(condId,mutexId,time);
        if(status!=EINTR) return status;
        errlogPrintf("pthread_cond_timedwait returned EINTR. Violates SUSv3\\n");
    }

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-22:

#19

From: Marty Kraimer

The standard states:

RETURN VALUE

    Except in the case of [ETIMEDOUT], all these error checks shall act
    as if they were performed immediately at the beginning of processing
    for the function and shall cause an error return, in effect, prior
    to modifying the state of the mutex specified by mutex or the
    condition variable specified by cond.

Upon successful completion, a value of zero shall be returned;
otherwise, an error number shall be returned to indicate the error.

ERRORS

The pthread_cond_timedwait() function shall fail if:

    [ETIMEDOUT]
        The time specified by abstime to pthread_cond_timedwait() has
    passed.[EINVAL]
    The value specified by abstime is invalid.

These functions may fail if:

    [EINVAL]
    The value specified by cond or mutex is invalid.[EPERM]
        The mutex was not owned by the current thread at the time of the
    call.

These functions shall not return an error code of [EINTR].

------------------------------------------------------------------------

So what does it mean?

I looked at "Programming with POSIX Threads" and it shows that the
pthread_xxx routines return the error status instead of putting it in errno.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-22:

#20

I was looking at the VERY old book "Guide to DEC Threads" and apparenetly this has changed as that book indicated that the error code was in errno as is posix convention.

The Linux man page states that "all condition variable functions return 0 on success and a non-zero error code on error". I wasnt certain if that meant "returns -1 on error and sets errno as is POSIX convention". Thats why I looked at the "Guide to DEC Threads" book. I should have looked in the Single UNIX Specification.

So we still have a mystery.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-22:

#21

New theory....

The function convertDoubleToWakeTime has the following code. It looks like it should be "if(wakeTime->tv_nsec>=1000000000L". A ">=" is required because it is invalid for there to be more than 999999999 nanoseconds in the fractional part.

    if(wakeTime->tv_nsec>1000000000L) {
        wakeTime->tv_nsec -= 1000000000L;
        ++wakeTime->tv_sec;
    }

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2004-10-26:

#22

From Marty

> > Good news!!!
> > With the change Jeff suggested to convertDoubleToWakeTime, i.e. dont let
> > tv_nsec==1000000000L
> > the changeLinks test ran all night without failing.

Revision history for this message

mrk (mrk-aps) wrote on 2004-11-11:

#23

fixed in release 3.14.7

The problem was in convertDoubleToWakeTime

Revision history for this message

Andrew Johnson (anj) wrote on 2004-12-06:

#24

R3.14.7 Released

EPICS Base

Long term timer thread failure under Linux kernel 2.4.21-20.ELsmp #1 SMP

Bug Description

Other bug subscribers

Remote bug watches