SEGV from simple CA client during context destroy

Bug #541272 reported by Jeff Hill
This bug report is a duplicate of:  Bug #541362: crash while exiting ca client. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Invalid
Medium
Jeff Hill

Bug Description

From Andrew:

I'm pretty sure this is not related to the problem I've been
investigating that Bob Soliday found, but the test program that
demonstrates that particular problem has several times now crashed with
a segfault on Solaris (occurring maybe 0.25% of the time or less).

This was built against R3-14-8-2 with debug turned on and optimization
off. I have saved the executable and core file which I used to get the
information below, so I should be able to answer more questions if you
have any. I haven't seen this crash from the 3.14.7 version, but I
haven't been running that as much so I don't know whether it exhibits
this problem or not.

Here's the backtrace for the faulting thread:

t@12 (l@12) terminated by signal SEGV (no mapping at the fault address) Current function is epicsMutexLock
   117 epicsMutexLockStatus status =
(dbx) where
current thread: t@12
=>[1] epicsMutexLock(pmutexNode = (nil)), line 117 in "epicsMutex.cpp"
   [2] epicsMutex::lock(this = 0x12ec60), line 211 in "epicsMutex.cpp"
   [3] epicsGuard<epicsMutex>::epicsGuard(this = 0xfea6f938, mutexIn =
CLASS), line 68 in "epicsGuard.h"
   [4] tsFreeList<timer,32U,epicsMutex>::release(this = 0x12ec60,
pCadaver = 0x147b98), line 190 in "tsFreeList.h"
   [5] timer::destroy(this = 0x147b98), line 54 in "timer.cpp"
   [6] tcpSendWatchdog::~tcpSendWatchdog(this = 0x15dae8), line 40 in
"tcpSendWatchdog.cpp"
   [7] tcpiiu::~tcpiiu(this = 0x15d9b0), line 1027 in "tcpiiu.cpp"
   [8] __SLIP.DELETER__C(0x15d9b0, 0x0, 0xfea6fbb8, 0x1372d8, 0x0, 0x0),
at 0x844a8
   [9] cac::destroyIIU(this = 0x12e450, iiu = CLASS), line 1166 in "cac.cpp"
   [10] tcpSendThread::run(this = 0x15da98), line 228 in "tcpiiu.cpp"
   [11] epicsThreadCallEntryPoint(pPvt = 0x15da9c), line 59 in
"epicsThread.cpp"
   [12] start_routine(arg = 0x1f6038), line 320 in "osdThread.c"

There are a total of 6 threads running, none explicitly started by the
test program so they're probably all CA threads:

(dbx) threads
       t@1 a l@1 ?() LWP suspended in _munmap()
       t@2 a l@2 start_routine() sleep on 0xff078a80 in __lwp_park()
       t@3 a l@3 start_routine() sleep on 0xff078a80 in __lwp_park()
       t@8 a l@8 start_routine() sleep on 0xff078a80 in __lwp_park()
o> t@12 a l@12 start_routine() signal SIGSEGV in epicsMutexLock()
      t@16 a l@16 start_routine() sleep on 0xff078a80 in __lwp_park()

The main thread t@1 is inside a ca_task_exit call where the program ends
up if it finishes normally (i.e. not demonstrating Bob's search failure
issue):

(dbx) where t@1
current thread: t@1
=>[1] _munmap(0x0, 0x80000, 0xff078a80, 0xff078000, 0x0, 0x0), at 0xff31dd0c
   [2] trim_stack_cache(0x9, 0xa, 0x2000, 0xff33c000, 0x0, 0xff040229),
at 0xff058100
   [3] find_stack(0xff079480, 0xff078a80, 0x0, 0x0, 0x0, 0x0), at 0xff0582f8
   [4] _thrp_create(0x0, 0xff200000, 0xa6eb0, 0x12ebb0, 0xc0,
0xffbfbdc4), at 0xff058780
   [5] _ti_pthread_create(0x12ebb8, 0x12ebbc, 0xa6eb0, 0x12ebb0, 0x0,
0x12ebb1), at 0xff05c074
   [6] epicsThreadCreate(name = 0xc9fe0 "errlog", priority = 10U,
stackSize = 131072U, funptr = 0x98e20 =
&`debugEvans`errlog.c`errlogThread(), parm = (nil)), line 402 in
"osdThread.c"
   [7] errlogInitPvt(arg = 0xffbfbfdc), line 413 in "errlog.c"
   [8] epicsThreadOnceOsd(id = 0xe06e4, func = 0x98800 =
&`debugEvans`errlog.c`errlogInitPvt(void *arg), arg = 0xffbfbfdc), line
375 in "osdThread.c"
   [9] errlogInit(bufsize = 0), line 435 in "errlog.c"
   [10] errlogFlush(), line 445 in "errlog.c"
   [11] cac::~cac(this = 0x12e450), line 289 in "cac.cpp"
   [12] __SLIP.DELETER__F(0x12e450, 0x1, 0xffbfc0e0, 0x148ed8,
0xffbfeb9b, 0xffbfc184), at 0x69d30
   [13]
epics_auto_ptr<cacContext,(epics_auto_ptr_type)0>::destroyTarget(this =
0x12e3ac), line 52 in "epicsMemory.h"
   [14] epics_auto_ptr<cacContext,(epics_auto_ptr_type)0>::reset(this =
0x12e3ac, pIn = (nil)), line 111 in "epicsMemory.h"
   [15] ca_client_context::~ca_client_context(this = 0x12e328), line 185
in "ca_client_context.cpp"
   [16] __SLIP.DELETER__F(0x12e328, 0x1, 0xffbfc2f8, 0x0, 0xffbfecb0,
0xffb35f3a), at 0x58010
   [17] ca_context_destroy(), line 251 in "access.cpp"
   [18] ca_task_exit(), line 265 in "access.cpp"
   [19] main(argc = 2, argv = 0xffbff364), line 230 in "debugTest.c"

The other 4 threads all look like they've closed down properly:

(dbx) where t@2
current thread: t@2
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
   [2] mutex_lock_queue(0xff078b44, 0x0, 0xff078a80, 0x0, 0x0, 0x0), at
0xff06166c
   [3] lmutex_lock(0xff078a80, 0xff07a074, 0x0, 0x0, 0x0, 0x0), at
0xff062824
   [4] _thrp_exit(0xff200200, 0x0, 0xff255b3c, 0x0, 0x0, 0x0), at 0xff058c14
   [5] _t_cancel(0xfeeaffa0, 0xff07a074, 0x0, 0xff059070, 0x0, 0x0), at
0xff057d18
   [6] _thr_exit_common(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xff059070

(dbx) where t@3
current thread: t@3
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
   [2] mutex_lock_queue(0xff078b44, 0x0, 0xff078a80, 0x0, 0x0, 0x0), at
0xff06166c
   [3] lmutex_lock(0xff078a80, 0xff07a074, 0x0, 0x0, 0x0, 0x0), at
0xff062824
   [4] _thrp_exit(0xff200400, 0x0, 0xff255b3c, 0x0, 0x0, 0x0), at 0xff058c14
   [5] _t_cancel(0xfee7ffa0, 0xff07a074, 0x0, 0xff059070, 0x0, 0x0), at
0xff057d18
   [6] _thr_exit_common(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xff059070

(dbx) where t@8
current thread: t@8
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
   [2] mutex_lock_queue(0xff078b44, 0x0, 0xff078a80, 0x0, 0x0, 0x0), at
0xff06166c
   [3] lmutex_lock(0xff078a80, 0xff07a074, 0x0, 0x0, 0x0, 0x0), at
0xff062824
   [4] _thrp_exit(0xff200e00, 0x0, 0xff255b3c, 0x0, 0x0, 0x0), at 0xff058c14
   [5] _t_cancel(0xfec6ffa0, 0xff07a074, 0x0, 0xff059070, 0x0, 0x0), at
0xff057d18
   [6] _thr_exit_common(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xff059070

(dbx) where t@16
current thread: t@16
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
   [2] mutex_lock_queue(0xff078b44, 0x0, 0xff078a80, 0x0, 0x0, 0x0), at
0xff06166c
   [3] lmutex_lock(0xff078a80, 0xff07a074, 0x0, 0x0, 0x0, 0x0), at
0xff062824
   [4] _thrp_exit(0xff201e00, 0x0, 0xff255b3c, 0x0, 0x0, 0x0), at 0xff058c14
   [5] _t_cancel(0xfe86ffa0, 0xff07a074, 0x0, 0xff059070, 0x0, 0x0), at
0xff057d18
   [6] _thr_exit_common(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xff059070

This issue is not a major 'drop everything' one as this is the only
place I've seen this particular problem so far, but it's definitely
something that should get fixed eventually.

Additional information:
The CA regression tests already include the attached function
(which tests a context destroy when a channel is connected).

So the only differences I can see are:
O Andrew's test has gets in it
O Andrew's test has more channels
O Andrew's test connects to more IOCs
O Andrew's test has a higher repetition count
O The failure Andrew has reported occurs on SOlaris

Andrew: Do you have preemptive POSIX schedulaing enabled?

I have attached also an upgraded version of the original test which
does not reproduce the problem. This test addresses the following
Issues:
O higher repitition count
O more channels
O addition of gets

I am so far unable to reproduce with the upgraded test.

Jeff

Original:

void verifyTearDownWhenChannelConnected ( const char * pName,
        enum ca_preemptive_callback_select select,
        unsigned interestLevel )
{
    unsigned i;

    showProgressBegin ( "verifyTearDownWhenChannelConnected", interestLevel );

    for ( i = 0u; i < 10; i++ ) {
        chid chan;
        int status;
        ca_context_create ( select );
        status = ca_create_channel ( pName, 0, 0, 0, & chan );
        SEVCHK ( status, "immediate tear down channel create failed" );
        status = ca_pend_io ( timeoutToPendIO );
        SEVCHK ( status, "immediate tear down channel connect failed" );
        assert ( status == ECA_NORMAL );
        ca_context_destroy ();
    }

    ca_context_create ( select );

    showProgressEnd ( interestLevel );
}

Upgraded:

void verifyTearDownWhenChannelConnected ( const char * pName,
        enum ca_preemptive_callback_select select,
        unsigned interestLevel )
{
    static const unsigned chanCount = 100;
    static const unsigned loopCount = 10000;
    chid *pChans;
    double *pValues;
    unsigned i, j;

    pChans = (chid *) calloc ( chanCount, sizeof ( *pChans ) );
    pValues = (double *) calloc ( chanCount, sizeof ( *pValues ) );
    assert ( pChans && pValues );

    showProgressBegin ( "verifyTearDownWhenChannelConnected", interestLevel );

    for ( i = 0u; i < loopCount; i++ ) {
        int status;
        ca_context_create ( select );

        for ( j = 0; j < chanCount; j++ ) {
            status = ca_create_channel ( pName, 0, 0, 0, & pChans[j] );
            SEVCHK ( status, "immediate tear down channel create failed" );
        }
        status = ca_pend_io ( timeoutToPendIO );
        SEVCHK ( status, "immediate tear down channel connect failed" );
        assert ( status == ECA_NORMAL );

        for ( j = 0; j < chanCount; j++ ) {
            status = ca_get ( DBR_DOUBLE, pChans[j], &pValues[j] );
            SEVCHK ( status, "immediate tear down channel get failed" );
        }

        status = ca_pend_io ( timeoutToPendIO );
        SEVCHK ( status, "immediate tear down get pend io failed" );

        ca_context_destroy ();
    }

    ca_context_create ( select );

    free ( pChans );
    free ( pValues );

    showProgressEnd ( interestLevel );
}

Original Mantis Bug: mantis-237
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=237

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

Jeff Hill wrote:
>
> Andrew: Do you have preemptive POSIX scheduling enabled?

% cd configure
% grep PRIORITY * os/* CONFIG_SITE:USE_POSIX_THREAD_PRIORITY_SCHEDULING = NO

Thus we have preemptive scheduling, but its not priority-based.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I had a very careful look at the code and can identify two
issues with the locking in this part of the code that I don’t
like. One of them would probably never cause trouble, and the
other might very rarely cause a failure, but not the stack trace
that Andrew reported.

Nevertheless, I prepared a fix, tested it with my regression
tests, and committed it to CVS.

Perhaps I will be wrong and your situation will improve.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Andrew reports an occurrence rate of 1 in 400 runs on Solaris. I have not been
able to reproduce this on windows despite making a test code that connects to
20 example servers (all running on the same linux host).

Changing topics:
Bob's no-connect issues was occcurring at a rate of 1 in 200 runs connecting
to R3.13 VME IOCs, but I am wondering if his test code fails due to the
client host accumulating too many sockets in the TIME_WAIT state. In any
case thats what I see here when I try to reproduce the SEGV issue on wiindows.
I also see the windows host running out of network buffers.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.11 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.