SEGV from simple CA client during context destroy
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Invalid
|
Medium
|
Jeff Hill |
Bug Description
From Andrew:
I'm pretty sure this is not related to the problem I've been
investigating that Bob Soliday found, but the test program that
demonstrates that particular problem has several times now crashed with
a segfault on Solaris (occurring maybe 0.25% of the time or less).
This was built against R3-14-8-2 with debug turned on and optimization
off. I have saved the executable and core file which I used to get the
information below, so I should be able to answer more questions if you
have any. I haven't seen this crash from the 3.14.7 version, but I
haven't been running that as much so I don't know whether it exhibits
this problem or not.
Here's the backtrace for the faulting thread:
t@12 (l@12) terminated by signal SEGV (no mapping at the fault address) Current function is epicsMutexLock
117 epicsMutexLockS
(dbx) where
current thread: t@12
=>[1] epicsMutexLock(
[2] epicsMutex:
[3] epicsGuard<
CLASS), line 68 in "epicsGuard.h"
[4] tsFreeList<
pCadaver = 0x147b98), line 190 in "tsFreeList.h"
[5] timer::destroy(this = 0x147b98), line 54 in "timer.cpp"
[6] tcpSendWatchdog
"tcpSendWatchdo
[7] tcpiiu:
[8] __SLIP.
at 0x844a8
[9] cac::destroyIIU
[10] tcpSendThread:
[11] epicsThreadCall
"epicsThread.cpp"
[12] start_routine(arg = 0x1f6038), line 320 in "osdThread.c"
There are a total of 6 threads running, none explicitly started by the
test program so they're probably all CA threads:
(dbx) threads
t@1 a l@1 ?() LWP suspended in _munmap()
t@2 a l@2 start_routine() sleep on 0xff078a80 in __lwp_park()
t@3 a l@3 start_routine() sleep on 0xff078a80 in __lwp_park()
t@8 a l@8 start_routine() sleep on 0xff078a80 in __lwp_park()
o> t@12 a l@12 start_routine() signal SIGSEGV in epicsMutexLock()
t@16 a l@16 start_routine() sleep on 0xff078a80 in __lwp_park()
The main thread t@1 is inside a ca_task_exit call where the program ends
up if it finishes normally (i.e. not demonstrating Bob's search failure
issue):
(dbx) where t@1
current thread: t@1
=>[1] _munmap(0x0, 0x80000, 0xff078a80, 0xff078000, 0x0, 0x0), at 0xff31dd0c
[2] trim_stack_
at 0xff058100
[3] find_stack(
[4] _thrp_create(0x0, 0xff200000, 0xa6eb0, 0x12ebb0, 0xc0,
0xffbfbdc4), at 0xff058780
[5] _ti_pthread_
0x12ebb1), at 0xff05c074
[6] epicsThreadCrea
stackSize = 131072U, funptr = 0x98e20 =
&`debugEvans`
"osdThread.c"
[7] errlogInitPvt(arg = 0xffbfbfdc), line 413 in "errlog.c"
[8] epicsThreadOnce
&`debugEvans`
375 in "osdThread.c"
[9] errlogInit(bufsize = 0), line 435 in "errlog.c"
[10] errlogFlush(), line 445 in "errlog.c"
[11] cac::~cac(this = 0x12e450), line 289 in "cac.cpp"
[12] __SLIP.
0xffbfeb9b, 0xffbfc184), at 0x69d30
[13]
epics_auto_
0x12e3ac), line 52 in "epicsMemory.h"
[14] epics_auto_
0x12e3ac, pIn = (nil)), line 111 in "epicsMemory.h"
[15] ca_client_
in "ca_client_
[16] __SLIP.
0xffb35f3a), at 0x58010
[17] ca_context_
[18] ca_task_exit(), line 265 in "access.cpp"
[19] main(argc = 2, argv = 0xffbff364), line 230 in "debugTest.c"
The other 4 threads all look like they've closed down properly:
(dbx) where t@2
current thread: t@2
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
[2] mutex_lock_
0xff06166c
[3] lmutex_
0xff062824
[4] _thrp_exit(
[5] _t_cancel(
0xff057d18
[6] _thr_exit_
(dbx) where t@3
current thread: t@3
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
[2] mutex_lock_
0xff06166c
[3] lmutex_
0xff062824
[4] _thrp_exit(
[5] _t_cancel(
0xff057d18
[6] _thr_exit_
(dbx) where t@8
current thread: t@8
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
[2] mutex_lock_
0xff06166c
[3] lmutex_
0xff062824
[4] _thrp_exit(
[5] _t_cancel(
0xff057d18
[6] _thr_exit_
(dbx) where t@16
current thread: t@16
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x0, 0x1, 0x0), at 0xff065998
[2] mutex_lock_
0xff06166c
[3] lmutex_
0xff062824
[4] _thrp_exit(
[5] _t_cancel(
0xff057d18
[6] _thr_exit_
This issue is not a major 'drop everything' one as this is the only
place I've seen this particular problem so far, but it's definitely
something that should get fixed eventually.
Additional information:
The CA regression tests already include the attached function
(which tests a context destroy when a channel is connected).
So the only differences I can see are:
O Andrew's test has gets in it
O Andrew's test has more channels
O Andrew's test connects to more IOCs
O Andrew's test has a higher repetition count
O The failure Andrew has reported occurs on SOlaris
Andrew: Do you have preemptive POSIX schedulaing enabled?
I have attached also an upgraded version of the original test which
does not reproduce the problem. This test addresses the following
Issues:
O higher repitition count
O more channels
O addition of gets
I am so far unable to reproduce with the upgraded test.
Jeff
Original:
void verifyTearDownW
enum ca_preemptive_
unsigned interestLevel )
{
unsigned i;
showProgres
for ( i = 0u; i < 10; i++ ) {
chid chan;
int status;
status = ca_create_channel ( pName, 0, 0, 0, & chan );
SEVCHK ( status, "immediate tear down channel create failed" );
status = ca_pend_io ( timeoutToPendIO );
SEVCHK ( status, "immediate tear down channel connect failed" );
assert ( status == ECA_NORMAL );
}
ca_
showProgressEnd ( interestLevel );
}
Upgraded:
void verifyTearDownW
enum ca_preemptive_
unsigned interestLevel )
{
static const unsigned chanCount = 100;
static const unsigned loopCount = 10000;
chid *pChans;
double *pValues;
unsigned i, j;
pChans = (chid *) calloc ( chanCount, sizeof ( *pChans ) );
pValues = (double *) calloc ( chanCount, sizeof ( *pValues ) );
assert ( pChans && pValues );
showProgres
for ( i = 0u; i < loopCount; i++ ) {
int status;
for ( j = 0; j < chanCount; j++ ) {
status = ca_create_channel ( pName, 0, 0, 0, & pChans[j] );
SEVCHK ( status, "immediate tear down channel create failed" );
}
status = ca_pend_io ( timeoutToPendIO );
SEVCHK ( status, "immediate tear down channel connect failed" );
assert ( status == ECA_NORMAL );
for ( j = 0; j < chanCount; j++ ) {
status = ca_get ( DBR_DOUBLE, pChans[j], &pValues[j] );
SEVCHK ( status, "immediate tear down channel get failed" );
}
status = ca_pend_io ( timeoutToPendIO );
SEVCHK ( status, "immediate tear down get pend io failed" );
}
ca_
free ( pChans );
free ( pValues );
showProgressEnd ( interestLevel );
}
Original Mantis Bug: mantis-237
http://
Jeff Hill wrote:
>
> Andrew: Do you have preemptive POSIX scheduling enabled?
% cd configure SITE:USE_ POSIX_THREAD_ PRIORITY_ SCHEDULING = NO
% grep PRIORITY * os/* CONFIG_
Thus we have preemptive scheduling, but its not priority-based.