ca_context_destroy() slow when circuit unresponsive

Bug #541169 reported by Jeff Hill
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
EPICS Base
Won't Fix
Medium
Jeff Hill

Bug Description

If the application calls ca_context_destroy (or
ca_task_exit) when a TCP circuit is unresponsive,
the library could wait a long time on a clean TCP
shutdown until the TCP circuit is internally aborted
by the IP kernel.

Additional information:
Many thanks to Yevgeny A. Gusev for bringing this
to our attention. He also was kind enough to supply the
following test code that reproduces the problem.

#include <stdio.h>
#include <stdlib.h>
#include <cadef.h>

static const char *progname;
static int disconnected = 0;

static void exception_handler( struct exception_handler_args args )
{
    if ( args.chid )
 fprintf( stderr, "%s: exception handler for \\"%s\\": %s (%s)\\n",
   progname, ca_name( args.chid ), ca_message( args.stat ),
   args.ctx );
    else
 fprintf( stderr, "%s: exception handler message: %s (%s)\\n",
   progname, ca_message( args.stat ), args.ctx );

    if ( ! ( args.stat & CA_M_SUCCESS ) &&
      CA_EXTRACT_SEVERITY( args.stat ) != CA_K_WARNING )
 ca_signal( args.stat, "CA exception" );
}

static void event_handler( struct event_handler_args args )
{
    if ( args.status == ECA_NORMAL )
 fprintf( stdout, "%s: event handler called for \\"%s\\"\\n", progname,
   ca_name( args.chid ) );
    else
 fprintf( stderr, "%s: event handler error for \\"%s\\": %s\\n", progname,
   ca_name( args.chid ), ca_message( args.status ) );
}

static void connect_handler( struct connection_handler_args args )
{
    chid chId = args.chid;

    fprintf( stderr, "%s: connection handler for \\"%s\\": %s\\n",
      progname, ca_name( chId ), ( ca_state( chId ) == cs_conn ) ?
     "CONNECTED" : "DISCONNECTED" );

    if ( ca_state( chId ) != cs_conn )
 disconnected = 1;
}

int main( int argc, char *argv[] )
{
    char *name;
    chid chId;
    evid evId;
    int status;

    if ( argc < 2 ) {
 fprintf( stdout, "%s: Usage: %s PV-name\\n", argv[0], argv[0] );
 exit( 0 );
    }

    progname = argv[0];

    status = ca_context_create( ca_disable_preemptive_callback );
    if ( status != ECA_NORMAL ) {
 fprintf( stderr, "%s can't create CA context: %s\\n",
   progname, ca_message( status ) );
 exit( 0 );
    }

    status = ca_add_exception_event( exception_handler, NULL );
    if ( status != ECA_NORMAL ) {
 ca_context_destroy();
 fprintf( stderr, "%s can't add CA exception handler: %s\\n",
   progname, ca_message( status ) );
 exit( 0 );
    }

    name = argv[1];

    status = ca_create_channel( name, ( caCh * ) connect_handler, NULL,
         ( capri ) 50, &chId );
    if ( ( status != ECA_NORMAL ) ||
       ( ( status = ca_pend_io( 5.0 ) ) != ECA_NORMAL ) ) {
 fprintf( stderr, "%s can't create channel \\"%s\\": %s\\n",
   progname, name, ca_message( status ) );
 exit( 0 );
    }

    status = ca_create_subscription( DBR_DOUBLE, 1, chId, DBE_VALUE | DBE_ALARM,
         event_handler, NULL, &evId );
    if ( ( status != ECA_NORMAL ) ||
       ( ( status = ca_pend_io( 5.0 ) ) != ECA_NORMAL ) ) {
 fprintf( stderr, "%s can't add event handler for \\"%s\\": %s\\n",
   progname, name, ca_message( status ) );
 exit( 0 );
    }

    while ( 1 ) {
    /*
 To see how ca_context_destroy() will hang up it must be called
 before ca_pend_event() when some network activity probably isn't
 completed.
    */
 if ( disconnected ) {
     fprintf( stdout, "%s: ca_context_destroy start\\n", progname );
     ca_context_destroy();
     fprintf( stdout, "%s: ca_context_destroy end\\n", progname );
     exit( 0 );
 }

 ca_pend_event( 5.0 );
    }

    exit( 1 );
}

Original Mantis Bug: mantis-95
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=95

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed the following problems:
o application could hang for too long in ca_context_destroy() attempting a
clean shutdown if circuit is known to be unresponsive
o application could hang for too long in ca_context_destroy() attempting a
clean shutdown if circuit is not known to be unresponsive, but is
o unexpected errno from recv diagnostic was improperrly supressed
for locally initiated circuit abort

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.7 Released

Revision history for this message
Matthieu Bec (mbec) wrote :

I found that in this scenario and using R3-14-11, ca_context_destroy() does not hang but takes up to 30 seconds (ca/tcpiiu.cpp). To the end users, the experience is similar to a hang.

Here are some easy steps to reproduce the issue

- start a Striptool (I use version 2.5.12.0)
- start a monitor on a connected channel
- unplug your network cable
- select file-> exit from the Striptool interface
- the GUI exits after 30s.

Note this scenario actually tends to happen quite often if you work from a laptop or on the wireless.

conversely, edm (1-11-0zi) does not attempt to gracefully shutdown channel access, produces some exception on stderr but does exit immediately.

It would be great to provide some ways in the API so well behaved clients some control over the timeout value. Maybe using a separate environment, and intuitively the value should default to EPICS_CA_CONN_TMO.

-- Thanks

Revision history for this message
Matthieu Bec (mbec) wrote :

I'm re-opening the bug in case my comment does not show on anyone's dashboard (it does not for me).

Changed in epics-base:
status: Fix Released → Incomplete
Revision history for this message
Jeff Hill (johill-lanl) wrote :

I think I understand that the complaint is that EPICS_CA_CONN_TMO is too long to wait to find out that a circuit is unresponsive, and needs to be abruptly shutdown, during CA client library shutdown.

Some observations that can be discussed:

o There are risks with a short timeout; we might not finish with pushing out a large transfer to a slow server when shutting down.

o Users _do_ have an option to change EPICS_CA_CONN_TMO.

o I'm not at this time 100% certain if we are blocked attempting to push data into the socket, or blocked when closing the socket in the above scenario.

o The linger socket options are complex so perhaps something can be done if the issue is with being blocked when closing the socket.

Revision history for this message
Matthieu Bec (mbec) wrote :

If I am willing to let some clients give up earlier and set EPICS_CA_CONN_TMO=10, the effective time to shut-down (...in that somewhat unusual scenario...) still takes 30s.
I think it's in the last step of tcpSendThread::run - that uses 30s no matter what I have in my environment:

    while ( ! this->iiu.recvThread.exitWait ( 30.0 ) ) {
        // it is possible to get stuck here if the user calls
        // ca_context_destroy() when a circuit isnt known to
        // be unresponsive, but is. That situation is probably
        // rare, and the IP kernel might have a timeout for
        // such situations, nevertheless we will attempt to deal
        // with it here after waiting a reasonable amount of time
        // for a clean shutdown to finish.
        epicsGuard < epicsMutex > guard ( this->iiu.mutex );
        this->iiu.initiateAbortShutdown ( guard );
    }

I don't know if those 30s should be EPICS_CA_CONN_TMO or a separate environment for more granularity. but it seems to me the value should default to EPICS_CA_CONN_TMO

Thank you

Revision history for this message
Matthieu Bec (mbec) wrote :

Reading your last point, I hope this can be fixed differently.
Back to my previous entry, based on experiment (so does not come with any kind of warranty!)
- if I set EPICS_CA_CONN_TMO=10, lose the network, close my client within those 10s - it will take 30s to exit
- if I set EPICS_CA_CONN_TMO=10, lose the network, close my client *after* those 10s, it will close immediately

As an end user application programmer/user, I would envision my app. to:
- quit reasonably promptly
- be reasonably tolerant to slow servers or network (meaning I am happy with the default EPICS_CA_CONN_TMO=30)
- try to have my app. reasonably well behaved (i.e, call context_destroy)
- ... but not so well behaved I have to wait a long time to exit, so maybe govern the final attempts to leave gracefully with a smaller value or a configurable environment like EPICS_CA_SHTDWN_TMO

--- Thanks

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Set the milestone back to 3.14-branch since the issue seems to be still unresolved.

Changed in epics-base:
milestone: 3.14.7 → 3.14.branch
Andrew Johnson (anj)
summary: - hang in ca_context_destroy() when circuit is unresponsive
+ ca_context_destroy() slow when circuit unresponsive
Changed in epics-base:
status: Incomplete → Confirmed
importance: Wishlist → Medium
Revision history for this message
Andrew Johnson (anj) wrote :

Nobody's worked on this in 7 years, it just ain't going to get fixed.

Changed in epics-base:
status: Confirmed → Incomplete
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.