CA client lib failure when IOC 'quit' (exit?) and immediate recreation of context

Bug #541336 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Wishlist
Jeff Hill

Bug Description

From: Victor F E Pucknell
Sent: 16 January 2008 15:01
To: Duggan, AJ (Andrew)
Cc: Owens, PH (Peter); Letts, SC (Simon)
Subject: EPICS problem

 Here is a simple test using the EPICS caget
   Normally it is in a loop using caget every 2 seconds.
If you "quit" the EPICS server then I get a serious failure and the test program is lost somewhere in the ca library

Here is the output from the test program

#############

ca_pend_io returned with rc=0x1: value=0x0 ca_get returned with rc=0x1 ca_pend_io returned with rc=0x1: value=0x0 ca_get returned with rc=0x1 ca_pend_io returned with rc=0x1: value=0x0 ca_get returned with rc=0x1 CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "nndhcp069.dl.ac.uk:5064"
    Source File: ../cac.cpp line 1126
    Current Time: Wed Jan 16 2008 14:46:01.574167550 ..................................................................
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "op=0, channel=MEIS-B-A-201:STA, type=DBR_SHORT, count=1, ctx="nndhcp069.dl.ac.uk:5064""
    Source File: ../getCopy.cpp line 86
    Current Time: Wed Jan 16 2008 14:46:01.574311751 ..................................................................
ca_pend_io failed User specified timeout on IO operation expired ca_get failed Virtual circuit disconnect EPICS server failure - resetting

A call to "assert (_pTargetMutex == & mutexToVerify)" failed in ../../../include/epicsGuard.h line 84.
EPICS Release EPICS R3.14.9-3.14.9 $R3-14-9$ $2007/02/05 16:31:45$.
Current time Wed Jan 16 2008 14:46:08.582331945.
Please E-mail this message to Jeff Hill <email address hidden> or to <email address hidden> Calling epicsThreadSuspendSelf()

#####################

However if kill the EPICS server using Control+C then the application does seem to recover and resume once I restart the EPICS server.

##################

ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "nndhcp069.dl.ac.uk:5064"
    Source File: ../cac.cpp line 1126
    Current Time: Wed Jan 16 2008 14:50:41.049109570
..................................................................
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "op=0, channel=MEIS-B-A-201:STA, type=DBR_SHORT, count=1,
ctx="nndhcp069.dl.ac.uk:5064""
    Source File: ../getCopy.cpp line 86
    Current Time: Wed Jan 16 2008 14:50:41.049255811
..................................................................
ca_pend_io failed User specified timeout on IO operation expired
ca_get failed Virtual circuit disconnect
EPICS server failure - resetting
Attempting start....
calling ca_context_create
calling ca_create_channel
calling ca_pend_io
ca_pend_io failed User specified timeout on IO operation expired
Attempting start....
calling ca_context_create
calling ca_create_channel
calling ca_pend_io
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0
ca_get returned with rc=0x1
ca_pend_io returned with rc=0x1: value=0x0

######################

This was not an exhaustive trial. I tried each only a couple of times.
However the asset failure and the suspend is typical of what we see.

Vic

Additional information:

#include <sys/types.h>
#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <string.h>

#include "cadef.h"

#define EPICS_TIMEOUT 5.0

int main(int argc, char *argv[])
{

char PVN[17] = "MEIS-B-A-201:STA\0";

int rc;
chid CID;
double timeout = EPICS_TIMEOUT;
unsigned short value;

START:

  printf("Attempting start....\n");

    sleep(5);

 printf("calling ca_context_create\n");

    rc = ca_context_create(ca_disable_preemptive_callback);
    if (rc != ECA_NORMAL) {
        printf("ca_context_create failed %s\n", ca_message(rc));
        goto START;
    }

 printf("calling ca_create_channel\n");

    rc = ca_create_channel(&PVN[0], NULL, NULL, 10, &CID);
    if (rc != ECA_NORMAL) {
        printf("ca_create_channel failed %s\n", ca_message(rc));
        (void) ca_context_destroy();
        goto START;
    }

 printf("calling ca_pend_io\n");

    rc = ca_pend_io(timeout);
    if (rc != ECA_NORMAL) {
        printf("ca_pend_io failed %s\n", ca_message(rc));
        goto START;
    }

    for(;;) {

    rc = ca_get(DBR_SHORT, CID, &value);
    if (rc != ECA_NORMAL) {
        printf("ca_get failed %s\n", ca_message(rc));
// if (CA_EXTRACT_SEVERITY(rc) >= CA_K_SEVERE) {
           printf("EPICS server failure - resetting\n");
           (void) ca_context_destroy();
// }
        goto START;
    }

    printf("ca_get returned with rc=0x%x\n", rc);

    rc = ca_pend_io(timeout);
    if (rc != ECA_NORMAL) {
        printf("ca_pend_io failed %s\n", ca_message(rc));
    } else {
        printf("ca_pend_io returned with rc=0x%x: value=0x%x\n", rc, value);
    }

    sleep(2);

    }

    exit(0);
}

Original Mantis Bug: mantis-306
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=306

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

I snagged a stack trace this morning

#0 0x001183ad in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#1 0x0036a006 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#2 0x006471a2 in epicsEventWait (pevent=0x99850a0) at ../../../src/libCom/osi/os/posix/osdEvent.c:77
#3 0x00644f30 in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/posix/osdThread.c:486
#4 0x00643967 in epicsAssert (pFile=0x48b2f0 "../../../include/epicsGuard.h", line=84, pExp=0x48b4a4 "_pTargetMutex == & mutexToVerify", pAuthorName=0x653e90 "Calling epicsThreadSuspendSelf()n") at ../../../src/libCom/osi/os/default/osdAssert.c:71
#5 0x0046fb94 in nciu::setServerAddressUnknown (this=0x99a6198, newiiu=@0xfffffffc, guard=@0x1) at ../../../include/epicsGuard.h:84
#6 0x00470828 in nciu::serviceShutdownNotify (this=0x99a6198, callbackControlGuard=@0xbfffadc0, mutualExclusionGuard=@0xbfffadb0) at ../nciu.cpp:577
#7 0x0046ea3a in disconnectGovernorTimer::shutdown (this=0x99a48d8, cbGuard=@0xbfffadc0, guard=@0xbfffadb0) at ../disconnectGovernorTimer.cpp:61
#8 0x004732c1 in udpiiu::shutdown (this=0x9994470, cbGuard=@0xbfffadc0, guard=@0xbfffadb0) at ../udpiiu.cpp:286
#9 0x00460a6a in ~cac (this=0x998a790) at ../cac.cpp:243
#10 0x0047ffc6 in ~ca_client_context (this=0x998af90) at ../../../include/epicsMemory.h:55
#11 0x004666c1 in ca_context_destroy () at ../access.cpp:251
#12 0x08048812 in main (argc=1, argv=0xbfffaf04) at ../test.c:41
(gdb)

edited on: 2008-09-25 09:43

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I committed this fix

cvs diff -wb -- disconnectGovernorTimer.cpp (in directory C:\hill\R3.14.dll_hell_fix\epics\base\src\ca\)
Index: disconnectGovernorTimer.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/disconnectGovernorTimer.cpp,v
retrieving revision 1.1.2.6
diff -u -b -w -b -r1.1.2.6 disconnectGovernorTimer.cpp
--- disconnectGovernorTimer.cpp 4 Nov 2005 15:54:34 -0000 1.1.2.6
+++ disconnectGovernorTimer.cpp 25 Sep 2008 15:53:05 -0000
@@ -50,11 +50,13 @@
     epicsGuard < epicsMutex > & cbGuard,
     epicsGuard < epicsMutex > & guard )
 {
+ {
     epicsGuardRelease < epicsMutex > unguard ( guard );
     {
         epicsGuardRelease < epicsMutex > cbUnguard ( cbGuard );
         this->timer.cancel ();
     }
+ }
     while ( nciu * pChan = this->chanList.get () ) {
         pChan->channelNode::listMember =
             channelNode::cs_none;

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in R3.14.10

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.10 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.