crash while exiting ca client

Bug #541362 reported by Jeff Hill
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
Jeff Hill

Bug Description

From Bob Soliday:

I have R3-14-2_branch from 2009-02-05 on linux-x86_64 built with HOST_OPT=NO and I compiled the included caputTest.c also with HOST_OPT=NO. I ran this inside gdb multiple times and got multiple problems. All the variations of gdb output are included in the CAissues file. Basically if I run it enough times I can get it to crash while exiting.

Additional information:
No problems with this but it looks like the last thread is not exited.

Starting program: /home/helios1/SOLIDAY/epics/extensions/src/oagca/O.linux-x86_64/caputTest
[Thread debugging using libthread_db enabled]
[New Thread 140204410697472 (LWP 8040)]
[New Thread 1084721488 (LWP 8050)]
[New Thread 1093708112 (LWP 8051)]
[New Thread 1081780560 (LWP 8052)]
[New Thread 1079662928 (LWP 8053)]
[New Thread 1078569296 (LWP 8054)]
[New Thread 1105955152 (LWP 8055)]
[New Thread 1088198992 (LWP 8056)]
[Thread 1081780560 (LWP 8052) exited]
[Thread 1079662928 (LWP 8053) exited]
[Thread 1105955152 (LWP 8055) exited]
[Thread 1078569296 (LWP 8054) exited]
[Thread 1093708112 (LWP 8051) exited]
[Thread 1084721488 (LWP 8050) exited]
[Thread 1088198992 (LWP 8056) exited]
[New Thread 1093708112 (LWP 8057)]

_____________________________________________________
No problems with this

Starting program: /home/helios1/SOLIDAY/epics/extensions/src/oagca/O.linux-x86_64/caputTest
[Thread debugging using libthread_db enabled]
[New Thread 139738863658752 (LWP 8058)]
[New Thread 1098991952 (LWP 8068)]
[New Thread 1089534288 (LWP 8069)]
[New Thread 1106905424 (LWP 8070)]
[New Thread 1081514320 (LWP 8071)]
[New Thread 1077770576 (LWP 8072)]
[New Thread 1082042704 (LWP 8073)]
[New Thread 1091991888 (LWP 8074)]
[Thread 1106905424 (LWP 8070) exited]
[Thread 1081514320 (LWP 8071) exited]
[Thread 1082042704 (LWP 8073) exited]
[Thread 1077770576 (LWP 8072) exited]
[Thread 1091991888 (LWP 8074) exited]
[Thread 1089534288 (LWP 8069) exited]
[Thread 1098991952 (LWP 8068) exited]
[New Thread 1089534288 (LWP 8075)]
[Thread 1089534288 (LWP 8075) exited]

_____________________________________________________
Segmentation fault with this one

Starting program: /home/helios1/SOLIDAY/epics/extensions/src/oagca/O.linux-x86_64/caputTest
[Thread debugging using libthread_db enabled]
[New Thread 139657383642880 (LWP 7791)]
[New Thread 1103264080 (LWP 7801)]
[New Thread 1100908880 (LWP 7802)]
[New Thread 1087953232 (LWP 7803)]
[New Thread 1100618064 (LWP 7804)]
[New Thread 1096583504 (LWP 7805)]
[New Thread 1077164368 (LWP 7806)]
[New Thread 1093613904 (LWP 7807)]
[Thread 1087953232 (LWP 7803) exited]
[Thread 1103264080 (LWP 7801) exited]
[Thread 1100908880 (LWP 7802) exited]
[Thread 1096583504 (LWP 7805) exited]
[Thread 1077164368 (LWP 7806) exited]
[Thread 1100618064 (LWP 7804) exited]
[New Thread 1091684688 (LWP 7808)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1093613904 (LWP 7807)]
0x0000000000436741 in tcpRecvWatchdog::shutdown (this=0x2222910) at ../tcpRecvWatchdog.cpp:229
229 this->timer.cancel ();
Current language: auto; currently c++
(gdb) bt
#0 0x0000000000436741 in tcpRecvWatchdog::shutdown (this=0x2222910) at ../tcpRecvWatchdog.cpp:229
#1 0x0000000000430d46 in ~tcpiiu (this=0x2222788) at ../tcpiiu.cpp:1015
#2 0x00000000004171f4 in cac::destroyIIU (this=0x21fdc50, iiu=@0x2222788) at ../cac.cpp:1152
#3 0x0000000000433756 in tcpSendThread::run (this=0x22228c8) at ../tcpiiu.cpp:228
#4 0x00000000004463c5 in epicsThreadCallEntryPoint (pPvt=0x22228d0) at ../../../src/libCom/osi/epicsThread.cpp:42
#5 0x000000000044cdb2 in start_routine (arg=0x22382e0) at ../../../src/libCom/osi/os/posix/osdThread.c:283
#6 0x00000031b1806407 in start_thread () from /lib64/libpthread.so.0
#7 0x00000031b0cd4b0d in clone () from /lib64/libc.so.6

_____________________________________________________
I don't know what happened here

Starting program: /home/helios1/SOLIDAY/epics/extensions/src/oagca/O.linux-x86_64/caputTest
[Thread debugging using libthread_db enabled]
[New Thread 140484916070144 (LWP 7985)]
[New Thread 1097197904 (LWP 7995)]
[New Thread 1076332880 (LWP 7996)]
[New Thread 1092413776 (LWP 7997)]
[New Thread 1080375632 (LWP 7998)]
[New Thread 1078511952 (LWP 7999)]
[New Thread 1093278032 (LWP 8000)]
[New Thread 1081186640 (LWP 8001)]
[Thread 1092413776 (LWP 7997) exited]
[Thread 1078511952 (LWP 7999) exited]
[Thread 1081186640 (LWP 8001) exited]
[Thread 1093278032 (LWP 8000) exited]
[Thread 1080375632 (LWP 7998) exited]
[Thread 1076332880 (LWP 7996) exited]
[Thread 1097197904 (LWP 7995) exited]
[New Thread 1092413776 (LWP 8002)]
[Switching to Thread 1093278032 (LWP 8000)]
Cannot remove breakpoints because program is no longer writable.
It might be running in another process.
Further execution is probably impossible.
0x00000031b18065ac in start_thread () from /lib64/libpthread.so.0
ptrace: No such process.

______________________________________________________
Not sure about this one either

Starting program: /home/helios1/SOLIDAY/epics/extensions/src/oagca/O.linux-x86_64/caputTest
[Thread debugging using libthread_db enabled]
[New Thread 140095465068288 (LWP 8233)]
[New Thread 1095080272 (LWP 8243)]
[New Thread 1081846096 (LWP 8244)]
[New Thread 1090562384 (LWP 8245)]
[New Thread 1103030608 (LWP 8246)]
[New Thread 1080772944 (LWP 8247)]
[New Thread 1091090768 (LWP 8248)]
[New Thread 1085905232 (LWP 8249)]
[Thread 1090562384 (LWP 8245) exited]
[Thread 1080772944 (LWP 8247) exited]
[Thread 1103030608 (LWP 8246) exited]
[Thread 1095080272 (LWP 8243) exited]
[Thread 1081846096 (LWP 8244) exited]
[Thread 1085905232 (LWP 8249) exited]
[Thread 1091090768 (LWP 8248) exited]
[New Thread 1080772944 (LWP 8250)]
Couldn't get registers: No such process.

____________________________________________________
It completely hung up with this and gdb had to be killed

Starting program: /home/helios1/SOLIDAY/epics/extensions/src/oagca/O.linux-x86_64/caputTest
[Thread debugging using libthread_db enabled]
[New Thread 139927175259904 (LWP 8539)]
[New Thread 1086777680 (LWP 8551)]
[New Thread 1081743696 (LWP 8552)]
[New Thread 1097648464 (LWP 8553)]
[New Thread 1089816912 (LWP 8554)]
[New Thread 1106667856 (LWP 8555)]
[New Thread 1089272144 (LWP 8556)]
[New Thread 1103780176 (LWP 8557)]
[Thread 1097648464 (LWP 8553) exited]
[Thread 1106667856 (LWP 8555) exited]
[Thread 1089816912 (LWP 8554) exited]
[Thread 1103780176 (LWP 8557) exited]
[Thread 1089272144 (LWP 8556) exited]
[Thread 1086777680 (LWP 8551) exited]
[Thread 1081743696 (LWP 8552) exited]
[New Thread 1106667856 (LWP 8558)]

_______________________________________________________
This one I added an abort() to osdMutex.c right after the
"epicsMutex pthread_mutex_lock failed: error Invalid argument"
error message so I could catch the backtrace.

Starting program: /home/helios1/SOLIDAY/epics/extensions/src/oagca/O.linux-x86_64/caputTest
[Thread debugging using libthread_db enabled]
[New Thread 140616365545216 (LWP 9163)]
[New Thread 1088264528 (LWP 9173)]
[New Thread 1104468304 (LWP 9174)]
[New Thread 1088530768 (LWP 9175)]
[New Thread 1096276304 (LWP 9176)]
[New Thread 1086306640 (LWP 9177)]
[New Thread 1089059152 (LWP 9178)]
[New Thread 1092725072 (LWP 9179)]
[Thread 1088530768 (LWP 9175) exited]
[Thread 1089059152 (LWP 9178) exited]
[Thread 1096276304 (LWP 9176) exited]
[Thread 1092725072 (LWP 9179) exited]
[Thread 1088264528 (LWP 9173) exited]
[Thread 1104468304 (LWP 9174) exited]
[New Thread 1092725072 (LWP 9180)]
epicsMutex pthread_mutex_lock1 failed DEBUG: error Invalid argument

Program received signal SIGABRT, Aborted.
[Switching to Thread 1086306640 (LWP 9177)]
0x00000031b0c30ec5 in raise () from /lib64/libc.so.6

(gdb) bt full
#0 0x00000031b0c30ec5 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00000031b0c32970 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x000000000044e128 in epicsMutexOsdLock (pmutex=0x22b78a0) at ../../../src/libCom/osi/os/posix/osdMutex.c:115
        status = 22
#3 0x0000000000446793 in epicsMutexLock (pmutexNode=0x22b6cc0) at ../../../src/libCom/osi/epicsMutex.cpp:145
        status = epicsMutexLockOK
#4 0x0000000000442095 in freeListFree (pvt=0x22b7870, pmem=0x22e3710) at ../../../src/libCom/freeList/freeListLib.c:122
        status = epicsMutexLockOK
        pfl = (FREELISTPVT *) 0x22b7870
        ppnext = (void **) 0x40bfac70
#5 0x00000000004345e2 in cac::releaseSmallBufferTCP (this=0x22b5c50, pBuf=0x22e3710 "0\005/\002") at ../cac.h:388
No locals.
#6 0x0000000000430da5 in ~tcpiiu (this=0x22da3f0) at ../tcpiiu.cpp:1024
No locals.
#7 0x00000000004171f4 in cac::destroyIIU (this=0x22b5c50, iiu=@0x22da3f0) at ../cac.cpp:1152
        mgr = {<notifyGuard> = {notify = @0x22b5860}, cbGuard = {_pTargetMutex = 0x22b5930}}
        guard = {_pTargetMutex = 0x22b5928}
        addr = {ia = {sin_family = 2, sin_port = 51219, sin_addr = {s_addr = 2466395812}, sin_zero = {30 '\036', 195 'Ã', 66 'B', 0 '\0', 0 '\0',
      0 '\0', 0 '\0', 0 '\0'}}, sa = {sa_family = 2, sa_data = "\023Ȥ6\002\223\036ÃB\000\000\000\000"}}
#8 0x0000000000433756 in tcpSendThread::run (this=0x22da530) at ../tcpiiu.cpp:228
        guard = {_pTargetMutex = 0x22b5928}
#9 0x00000000004463c5 in epicsThreadCallEntryPoint (pPvt=0x22da538) at ../../../src/libCom/osi/epicsThread.cpp:42
        pThread = (epicsThread *) 0x22da538
        waitRelease = true
#10 0x000000000044cdb2 in start_routine (arg=0x22d7e60) at ../../../src/libCom/osi/os/posix/osdThread.c:283
        pthreadInfo = (epicsThreadOSD *) 0x22d7e60
        status = 0
        oldtype = 0
        blockAllSig = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}}
#11 0x00000031b1806407 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#12 0x00000031b0cd4b0d in clone () from /lib64/libc.so.6
No symbol table info available.

Original Mantis Bug: mantis-334
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=334

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

Here is the source code for the test:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cadef.h>
#include <epicsVersion.h>

int main(int argc, char **argv)
{

  char name1[40], name2[40];
  chid chid1, chid2;
  long result;

  sprintf(name1, "S30B:P1:msAve:AveEnbBO");
  sprintf(name2, "S31B:P1:msAve:AveEnbBO");

  result = ca_context_create(ca_disable_preemptive_callback);
  if (result != ECA_NORMAL) {
    fprintf(stderr, "CA error %s occurred while trying to start channel access.\n", ca_message(result));
    return(1);
  }

  result = ca_search(name1, &chid1);
  if (result != ECA_NORMAL) {
    fprintf(stderr, "error: problem doing search for %s\n", name1);
    return(1);
  }

  result = ca_search(name2, &chid2);
  if (result != ECA_NORMAL) {
    fprintf(stderr, "error: problem doing search for %s\n", name2);
    return(1);
  }

  result = ca_pend_io(10);
  if (result != ECA_NORMAL) {
    fprintf(stderr, "warning: problem doing search for PVs\n");
  }

  result = ca_state(chid1);
  if (result != cs_conn) {
    result = ca_put(DBR_STRING, chid1, "0");
    if (result != ECA_NORMAL) {
      fprintf(stderr, "error: problem doing put for %s\n", name1);
      return(1);
    }
  }

  result = ca_state(chid2);
  if (result != cs_conn) {
    result = ca_put(DBR_STRING, chid2, "0");
    if (result != ECA_NORMAL) {
      fprintf(stderr, "error: problem doing put for %s\n", name2);
      return(1);
    }
  }

  result = ca_pend_io(10);
  if (result != ECA_NORMAL) {
    fprintf(stderr, "error: problem doing put for PVs\n");
  }

  ca_context_destroy();

  return(0);

}

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I ran into this build issue on Linux when attempting to reproduce this

~/epicsR3.14/epics/appl/clientApp$ make
perl ../../base/bin/linux-x86/makeMakefile.pl O.linux-x86 ../..
perl ../../base/bin/linux-x86/makeMakefile.pl O.vxWorks-ppc603 ../..
mkdir O.Common
make -C O.linux-x86 -f ../Makefile TOP=../.. T_A=linux-x86 install
make[1]: Entering directory `/home/hill/epicsR3.14/epics/appl/clientApp/O.linux-x86'
/usr/bin/gcc -c -D_POSIX_C_SOURCE=199506L -D_POSIX_THREADS -D_XOPEN_SOURCE=500 -D_X86_ -DUNIX -D_BSD_SOURCE -Dlinux -D_REENTRANT -O3 -Wall -m32 -g -I. -I../O.Common -I. -I.. -I../../include/os/Linux -I../../include -I../../../base/include/os/Linux -I../../../base/include ../caExample.c
/usr/bin/g++ -o caExample -L/home/hill/epicsR3.14/epics/base/lib/linux-x86 -Wl,-rpath,/home/hill/epicsR3.14/epics/base/lib/linux-x86 -m32 caExample.o -lcas -lgdd -lasHost -ldbStaticHost -lregistryIoc -lca -lCom
/home/hill/epicsR3.14/epics/base/lib/linux-x86/libCom.so: undefined reference to `dlerror'
/home/hill/epicsR3.14/epics/base/lib/linux-x86/libCom.so: undefined reference to `dlopen'
/home/hill/epicsR3.14/epics/base/lib/linux-x86/libCom.so: undefined reference to `dlsym'
collect2: ld returned 1 exit status
make[1]: *** [caExample] Error 1
make[1]: Leaving directory `/home/hill/epicsR3.14/epics/appl/clientApp/O.linux-x86'
make: *** [install.linux-x86] Error 2

Revision history for this message
Jeff Hill (johill-lanl) wrote :

While working on mantis 334 I put a loop around Bob’s test so that it was continually doing essentially this.

Forever {
  ca_context_create();
  Do some ca stuff
  ca_context_destroy ();
End forever

I haven’t reproduced the crash that Bob reported, but I did see that memory was leaked, and the process eventually ran out of threads.

Closer inspect revealed that after this program runs for a short time period there were many threads (208 of them) like below in the Linux process. This is old Linux BTW.

My suspicion is that this is new behavior. Is this related to Mantis 326 ('FATAL: exception not rethrown' errors from NPTL)?

I’m guessing that something is wrong at a low level since this code is linked and compiled debug and I don’t see anything on the stack from EPICS.

(gdb) bt
#0 0x00242b1c in __nanosleep_nocancel () from /lib/tls/libc.so.6
#1 0x00c6afc0 in epicsThreadSleep (seconds=1)
    at ../../../src/libCom/osi/os/posix/osdThread.c:573
#2 0x080489b8 in main (argc=1, argv=0xbfff9a44) at ../caTest.c:72
(gdb) thread 2
[Switching to thread 2 (Thread -1297912912 (LWP 22606))]#0 0x0011a939 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) bt
#0 0x0011a939 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0x00116b21 in _L_mutex_lock_949 () from /lib/tls/libpthread.so.0
#2 0x00000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread -1296856144 (LWP 22604))]#0 0x0011a939 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) bt
#0 0x0011a939 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0x00116b21 in _L_mutex_lock_949 () from /lib/tls/libpthread.so.0
#2 0x00000000 in ?? ()
(gdb) thread 8
[Switching to thread 8 (Thread -1295000656 (LWP 22578))]#0 0x0011a939 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) bt
#0 0x0011a939 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0x00116b21 in _L_mutex_lock_949 () from /lib/tls/libpthread.so.0
#2 0x00000000 in ?? ()

There is one thread in the process like this, but no doubt its benign.
(gdb) bt
#0 0x001183ad in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#1 0x00283006 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#2 0x00c6bc46 in condWait (condId=0x8288328, mutexId=0x8288310)
    at ../../../src/libCom/osi/os/posix/osdEvent.c:77
#3 0x00c6bff9 in epicsEventWait (pevent=0x8288310)
    at ../../../src/libCom/osi/os/posix/osdEvent.c:139
#4 0x00c4fd03 in errlogThread () at ../../../src/libCom/error/errlog.c:489
#5 0x00c6a36b in start_routine (arg=0x8288360)
    at ../../../src/libCom/osi/os/posix/osdThread.c:283
#6 0x00115dd8 in start_thread () from /lib/tls/libpthread.so.0
#7 0x00275fca in clone () from /lib/tls/libc.so.6

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Some source code for my test:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cadef.h>
#include <epicsVersion.h>
#include <epicsThread.h>

int main(int argc, char **argv)
{

    char name1[40], name2[40];
    chid chid1, chid2;
    long result;
    unsigned i;

    sprintf(name1, "joh:fred");
    sprintf(name2, "joh:jane");

    for ( i=0u; i < 1000; i++ ) {
        result = ca_context_create(ca_disable_preemptive_callback);
        if (result != ECA_NORMAL) {
            fprintf(stderr, "CA error %s occurred while trying to start channel access.n", ca_message(result));
            return(1);
        }

        result = ca_search(name1, &chid1);
        if (result != ECA_NORMAL) {
            fprintf(stderr, "error: problem doing search for %sn", name1);
            return(1);
        }

        result = ca_search(name2, &chid2);
        if (result != ECA_NORMAL) {
            fprintf(stderr, "error: problem doing search for %sn", name2);
            return(1);
        }

        result = ca_pend_io(10);
        if (result != ECA_NORMAL) {
            fprintf(stderr, "warning: problem doing search for PVsn");
        }

        result = ca_state(chid1);
        if (result != cs_conn) {
            result = ca_put(DBR_STRING, chid1, "0");
            if (result != ECA_NORMAL) {
                fprintf(stderr, "error: problem doing put for %sn", name1);
                return(1);
            }
        }

        result = ca_state(chid2);
        if (result != cs_conn) {
            result = ca_put(DBR_STRING, chid2, "0");
            if (result != ECA_NORMAL) {
                fprintf(stderr, "error: problem doing put for %sn", name2);
                return(1);
            }
        }

        result = ca_pend_io(10);

        result = ca_pend_io(10);
        if (result != ECA_NORMAL) {
            fprintf(stderr, "error: problem doing put for PVsn");
        }

        ca_context_destroy();

        epicsThreadSleep ( 1.0 );
    }
    return(0);
}

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Note that I do have an epicsThreadSleep call in the loop so I am not spinning using all of the CPU - which might prevent threads from exiting.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Is it possible to reproduce the issue on other versions of Linux? For example, on 32 bit Linux?

Revision history for this message
soliday (soliday) wrote :

I just ran the original code a Fedora Core 8 32bit system and saw:

A call to "assert (status == epicsMutexLockOK)" failed in ../../../src/libCom/freeList/freeListLib.c line 122.
EPICS Release EPICS R3.14.11-CVS $$Name: R3-14-2_branch $$ $$Date: 2008/10/27 20:19:12 $$.
Current time Fri Mar 13 2009 13:32:57.160053501.
Please E-mail this message to the author or to <email address hidden>
Calling epicsThreadSuspendSelf()

I also saw:

Segmentation fault

Sometimes it takes a lot of iterations to see an error message and sometimes it doesn't. I built it against the latest R3-14-2_branch available today in CVS. The original reported errors were seen on a Fedora Core 8 64bit system. Currently we only have Fedora Core 8 systems on this subnet but may soon have a Fedora Core 10 system that I can test on.

Revision history for this message
Andrew Johnson (anj) wrote :

There are bugs in this test code, even in the version that Jeff posted; many tests for ca_state(chid) != cs_conn should be ca_state(chid) == cs_conn.

Please re-run with them fixed. If it still fails, I would also try running it without the call to ca_context_destroy() at the end to see whether the problem still occurs.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Bob,

I took some time today to look into Mantis 334 again and I think I have a way that you can reproduce the problem.

First the test code has the if statement backward when it checks to see if the PV is connected before putting the new value. But as it turns out the whole section related to putting the value is not needed to show the
  problem. I took your test code and updated it:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cadef.h>
#include <epicsVersion.h>
#include <epicsThread.h>

int main(int argc, char **argv)
{
   char name1[40], name2[40];
   chid chid1, chid2;
   long result;
   unsigned i;

   sprintf(name1, "bob");
   sprintf(name2, "bill");

   for ( i=0u; i < 1000; i++ ) {
     fprintf(stdout, "Step %u\n", i);
     result = ca_context_create(ca_disable_preemptive_callback);
     if (result != ECA_NORMAL) {
       fprintf(stderr, "CA error %s occurred while trying to start channel access.n", ca_message(result));
       return(1);
     }

     result = ca_search(name1, &chid1);
     if (result != ECA_NORMAL) {
       fprintf(stderr, "error: problem doing search for %sn", name1);
       return(1);
     }

     result = ca_search(name2, &chid2);
     if (result != ECA_NORMAL) {
       fprintf(stderr, "error: problem doing search for %sn", name2);
       return(1);
     }

     result = ca_pend_io(10);
     if (result != ECA_NORMAL) {
       fprintf(stderr, "warning: problem doing search for PVsn");
     }
     ca_context_destroy();

     epicsThreadSleep ( 0.25 );
   }
   return(0);
}

What is really needed to get this to crash is if the two PVs come from two different IOCs. To make it simple I even verified that excas can be used. Just run one excas on one computer and a slightly modified one on another. For the modified one I just changed the PV bill to bob. That's it, run the program listed above and I think you should see problem. If not then I guess it might be related to our version of Linux.

--Bob

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Andrew:

Jeff: Significant information, I only get crashes when running the test client on SMP systems. On machines with a single CPU it runs all 1000 iterations with no crashes. I have found this on both linux and solaris, although interestingly a solaris-x86 box with 2 CPUs crashes much less often than a solaris-sparc box does.

Here's the back-trace from the failing solaris-x86 thread:

@1946 (l@1946) terminated by signal SEGV (no mapping at the fault address)
0xffffffffffffffff: <bad address 0xffffffffffffffff>
(dbx) threads
      t@1 a l@1 ?() LWP suspended in ___nanosleep()
      t@9 a l@9 start_routine() sleep on 0x8083058 in __lwp_park()
o> t@1946 a l@1946 start_routine() signal SIGSEGV in ?()
(dbx) where
current thread: t@1946
  [1] 0x814c8e8(0x80bae38, 0xfee3d820, 0xfe70cbb8, 0xfee1ff2a), at 0x814c8e8
  [2] hostNameCache::~hostNameCache(0x80b01cc, 0x80be230, 0xfe70cbe8, 0xfee0d18a), at 0xfee1ff4b
  [3] tcpiiu::~tcpiiu(0x80b01a0, 0x0, 0x244e26ff, 0xfee104dc), at 0xfee0d2ed
  [4] __SLIP.DELETER__E(0x80b01a0, 0x0, 0xfe70cc50, 0x808bcb0), at 0xfee104ee
  [5] cac::destroyIIU(0x81482a8, 0x80b01a0, 0x403e0000, 0xfee0a5c1), at
0xfedf4127
  [6] tcpSendThread::run(0x80b0288, 0x0, 0x0, 0xfed8e861), at 0xfee0acf5
  [7] epicsThreadCallEntryPoint(0x80b028c, 0x81442b8, 0x0, 0xfed957ea), at 0xfed8e8ad
  [8] start_routine(0x81442b8), at 0xfed958d4
  [9] _thr_setup(0xfe9d3a00), at 0xfef259a9
  [10] _lwp_start(0x80bae38, 0xfee3d820, 0xfe70cbb8, 0xfee1ff2a, 0x80b03b0, 0xfee3d820), at 0xfef25c90

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Bob:

I just tested this idea on a node that I have been seeing the problems on. When I force it to run on one CPU core with "taskset -c 0 ./cadebug"
I no longer see the crashes.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Andrew:

On Monday 20 April 2009 14:36:29 Bob Soliday wrote:
> I just tested this idea on a node that I have been seeing the problems
> on. When I force it to run on one CPU core with "taskset -c 0 ./cadebug"
> I no longer see the crashes.

I still see both the freeListLib thread suspension and the crash with "terminate called ... ::invalidMutex() ... Abort" on FC-6 when running on 1 CPU, but no crashes on FC-8.

a
I did get another different error message from a 2-cpu crash on FC-8:

Step 13
pure virtual method called
terminate called without an active exception
Abort (core dumped)

Core was generated by `./crash'.
Program terminated with signal 6, Aborted.
#0 0x0000003a15230ec5 in raise () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 4 (process 12257):
#0 0x0000003a15e0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1 0x0000000006721881 in epicsEventWait (pevent=0x1717620)
    at ../../../src/libCom/osi/os/posix/osdEvent.c:77
#2 0x000000000670a85b in errlogThread ()
    at ../../../src/libCom/error/errlog.c:489
#3 0x00000000067202f3 in start_routine (arg=<value optimized out>)
    at ../../../src/libCom/osi/os/posix/osdThread.c:283
#4 0x0000003a15e06407 in start_thread () from /lib64/libpthread.so.0
#5 0x0000003a152d4b0d in clone () from /lib64/libc.so.6

Thread 3 (process 12368):
#0 0x0000003a15e06482 in start_thread () from /lib64/libpthread.so.0
#1 0x0000003a152d4b0d in clone () from /lib64/libc.so.6

Thread 2 (process 12247):
#0 0x0000003a1529ac61 in nanosleep () from /lib64/libc.so.6
#1 0x000000000671fbda in epicsThreadSleep (seconds=250000000)
    at ../../../src/libCom/osi/os/posix/osdThread.c:573
#2 0x00000000004009d3 in main (argc=<value optimized out>,
    argv=<value optimized out>) at ../crash.c:44

Thread 1 (process 12374):
#0 0x0000003a15230ec5 in raise () from /lib64/libc.so.6
#1 0x0000003a15232970 in abort () from /lib64/libc.so.6
#2 0x0000003a1b6becb4 in __gnu_cxx::__verbose_terminate_handler ()
   from /usr/lib64/libstdc++.so.6
#3 0x0000003a1b6bce76 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x0000003a1b6bcea3 in std::terminate () from /usr/lib64/libstdc++.so.6
#5 0x0000003a1b6bd35f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6 0x0000000000e78230 in cac::destroyIIU (this=0x1716ab0, iiu=@0x1749628)
    at ../cac.h:423
#7 0x0000000000e8fe7c in tcpSendThread::run (this=0x1749768)
    at ../tcpiiu.cpp:228
#8 0x000000000671a13d in epicsThreadCallEntryPoint (
    pPvt=<value optimized out>) at ../../../src/libCom/osi/epicsThread.cpp:42
#9 0x00000000067202f3 in start_routine (arg=<value optimized out>)
    at ../../../src/libCom/osi/os/posix/osdThread.c:283
#10 0x0000003a15e06407 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003a152d4b0d in clone () from /lib64/libc.so.6

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I ran this test code on our 8 CPU red hat system and I see symptoms of Linux hanging up due to it running out of threads at about step 198. In the debugger all of the threads look like this.

(gdb) #0 0x006b1939 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) #1 0x006adb21 in _L_mutex_lock_949 () from /lib/tls/libpthread.so.0
(gdb) #2 0x00000000 in ?? ()

edited on: 2009-07-01 18:08

Revision history for this message
Jeff Hill (johill-lanl) wrote :

BTW, instead of moding excas, one can use the -pjoh1: option with excas and then search for PV's named joh1:bill

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I just now got it to reproduce on windows vista two cpu system.

> Com.dll!epicsMutexLock(epicsMutexParm * pmutexNode=0xdddddddd) Line 145 + 0x3 bytes C++
  Com.dll!freeListFree(void * pvt=0x00dcc978, void * pmem=0x00d05750) Line 122 + 0xc bytes C
  ca.dll!cac::releaseSmallBufferTCP(char * pBuf=0x00d05750) Line 388 + 0x16 bytes C++
  ca.dll!tcpiiu::~tcpiiu() Line 1026 C++
  ca.dll!tcpiiu::`scalar deleting destructor'() + 0x16 bytes C++
  ca.dll!cac::destroyIIU(tcpiiu & iiu={...}) Line 1152 + 0x10 bytes C++
  ca.dll!tcpSendThread::run() Line 229 C++
  Com.dll!epicsThreadCallEntryPoint(void * pPvt=0x00d0ae24) Line 67 + 0x13 bytes C++
  Com.dll!epicsWin32ThreadEntry(void * lpParameter=0x00d00640) Line 498 + 0x11 bytes C
  msvcr80d.dll!_callthreadstartex() Line 348 + 0xf bytes C
  msvcr80d.dll!_threadstartex(void * ptd=0x00ce96f8) Line 331 C
  kernel32.dll!75e8e4a5()
  [Frames below may be incorrect and/or missing, no symbols loaded for kernel32.dll]
  ntdll.dll!7769cfed()
  ntdll.dll!7769d1ff()

Revision history for this message
Jeff Hill (johill-lanl) wrote :

ntdll.dll!7764039d()
  [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]
  ntdll.dll!7765db0b()
  ntdll.dll!7764039d()
  kernel32.dll!75e11270()
> Com.dll!epicsMutexOsdLock(epicsMutexOSD * pSem=0x00cc0da0) Line 126 + 0xc bytes C
  Com.dll!epicsMutexLock(epicsMutexParm * pmutexNode=0x00cd2a68) Line 145 + 0xc bytes C++
  Com.dll!epicsMutex::lock() Line 238 + 0xb bytes C++
  ca.dll!epicsGuard<epicsMutex>::epicsGuard<epicsMutex>(epicsMutex & mutexIn={...}) Line 71 + 0xd bytes C++
  ca.dll!cac::~cac() Line 242 + 0x12 bytes C++
  ca.dll!cac::`scalar deleting destructor'() + 0x16 bytes C++
  ca.dll!epics_auto_ptr<cacContext,0>::destroyTarget() Line 52 + 0x24 bytes C++
  ca.dll!epics_auto_ptr<cacContext,0>::reset(cacContext * pIn=0x00000000) Line 112 C++
  ca.dll!ca_client_context::~ca_client_context() Line 189 C++
  ca.dll!ca_client_context::`scalar deleting destructor'() + 0x16 bytes C++
  ca.dll!ca_context_destroy() Line 251 + 0x22 bytes C++
  caTest.exe!main(int argc=1, char * * argv=0x00cc0d10) Line 44 + 0x8 bytes C
  caTest.exe!__tmainCRTStartup() Line 597 + 0x19 bytes C
  caTest.exe!mainCRTStartup() Line 414 C
  kernel32.dll!75e8e4a5()
  ntdll.dll!7769cfed()
  ntdll.dll!7769d1ff()

edited on: 2009-07-01 18:28

Revision history for this message
Jeff Hill (johill-lanl) wrote :

inline void cac::releaseSmallBufferTCP ( char *pBuf )
{
    // this locks internally
    freeListFree ( this->tcpSmallRecvBufFreeList, pBuf );
}

is there stack corruption?

in the prev frame we have this

  this->tcpSmallRecvBufFreeList 0x00d02a80 void *

and in the next frame we have this?

> Com.dll!freeListFree(void * pvt=0x00dcc978, void * pmem=0x00d05750) Line 122 + 0xc bytes C

Revision history for this message
Jeff Hill (johill-lanl) wrote :

an unusual feature of this test program is that the channels are not destroyed when the user is done using them.

Exactly 2000 channels are created by this test code.

Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (4.2 KiB)

I have a better understanding of what is occurring now, and protections already in the code should prevent the observed behavior, but somehow they don’t.

So apparently I am at a critical juncture where I need to understand better cache coherency enforcement on SMP systems. If you can explain why the patch below works that would be great because the observed behavior changes are currently a bit disturbing.

The summary of the patch is as flows (diffs are below).

Before:
I had before a list of circuits and a flag that said a circuit rundown was in progress. I would not allow the client context to finish its destructor until the circuit count (circuit list count) was zero and also a circuit rundown in progress bool flag was false. Of course both the count and the flag were only accessed when the lock (guard) is applied.

After:
The patch eliminates the circuit rundown in progress flag, and replaces it with an unsigned iiuExistenceCount. I will not allow the client context to finish its destructor until iiuExistenceCount is zero. The iiuExistenceCount is incremented when a circuit is installed into the client context, and decremented only after the circuit is fully uninstalled. Of course iiuExistenceCount is only accessed when the lock (guard) is applied.

Note that epicsGuard < epicsMutex > locks the mutex in its constructor and it unlocks the mutex in its destructor, and epicsGuardRelease < epicsMutex > unlocks the mutex in its constructor and locks the mutex in its destructor.

The patch has run through approximately 6k iterations w/o crashing. Before it was crashing on average at around 500 iterations.

Jeff

cvs diff -wb (in directory C:\hill\R3.14.dll_hell_fix\epics\base\src\ca\)
? Edit1.TXT
? RCa01512
? tst.cpp
Index: cac.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/cac.cpp,v
retrieving revision 1.180.2.39
diff -u -b -w -b -r1.180.2.39 cac.cpp
--- cac.cpp 11 Feb 2009 00:50:27 -0000 1.180.2.39
+++ cac.cpp 7 Jul 2009 16:31:53 -0000
@@ -143,7 +143,7 @@
     initializingThreadsPriority ( epicsThreadGetPrioritySelf() ),
     maxRecvBytesTCP ( MAX_TCP ),
     beaconAnomalyCount ( 0u ),
- iiuUninstallInProgress ( false )
+ iiuExistenceCount ( 0u )
 {
  if ( ! osiSockAttach () ) {
         throwWithLocation ( caErrorCode (ECA_INTERNAL) );
@@ -263,7 +263,7 @@
     //
     {
         epicsGuard < epicsMutex > guard ( this->mutex );
- while ( this->circuitList.count() > 0 || this->iiuUninstallInProgress ) {
+ while ( this->iiuExistenceCount > 0 ) {
             epicsGuardRelease < epicsMutex > unguard ( guard );
             this->iiuUninstall.wait ();
         }
@@ -541,6 +541,7 @@
             }
             this->serverTable.add ( *pnewiiu );
             this->circuitList.add ( *pnewiiu );
+ this->iiuExistenceCount++;
             pBHE->registerIIU ( guard, *pnewiiu );
             piiu = pnewiiu.release ();
             newIIU = true;
@@ -1121,7 +1122,6 @@
     {
         callbackManager mgr ( this->notify, this->cbMutex );
         epicsGuard < epicsMutex > guard ( this->mutex );
- this->iiuUninstallInProgress = true;
 ...

Read more...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Duuuu!

Of course the bug occurs when two clients shutdown at once and both of them manipulate iiuUninstallInProgress at the same time causing it to be set false too early.

Thanks for listening – sometimes explaining your problem causes the solution to be apparent.

It’s also nice to rest assured that SMP cache coherency and mutex locking interact correctly!

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in R3.14.11

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.11 released.

summary: - crash whil exiting ca client
+ crash while exiting ca client
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.