Shutdown order problem ("epicsMutexLock failed" on client exit)

Bug #541249 reported by Jeff Hill
This bug report is a duplicate of:  Bug #541362: crash while exiting ca client. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Wishlist
Andrew Johnson

Bug Description

From Chris Slominski

I have downloaded EPICS base 3.14.7 and built it under the provided 'hpux-parisc' setting of $EPICS_HOST_ARC. All built fine. I built a test client program that performs a couple channel access fetches and prints the results. The test program works fine when I build against our existing EPICS 3.13.2 installation, but there is a problem running with the newer EPICS version. The program performs the channels access operations correctly, but the signal value prints are sometimes followed by "epicsThreadOnceOsd epicsMutexLock failed" written to the command shell. I suspect the message is generated when I close the channel and exit the program. The number of times the message is written varies from zero to two. More often it is zero, and only once have I seen two. Can anyone explain to me what this message means, and what I need to do to correct it?

Additional information:
This occurs in this code

void epicsThreadOnceOsd(epicsThreadOnceId *id, void (*func)(void *), void *arg)
{
    int status;
    epicsThreadInit();
    status = mutexLock(&onceLock);
    if(status) {
        fprintf(stderr,"epicsThreadOnceOsd epicsMutexLock failed.n");
        exit(-1);
    }
    if (*id == 0) { /* 0 => first call */
        *id = -1; /* -1 => func() active */
        /* avoid recursive locking */
        status = pthread_mutex_unlock(&onceLock);
        checkStatusQuit(status,"pthread_mutex_unlock","epicsThreadOnceOsd");
           func(arg);
        status = mutexLock(&onceLock);
        checkStatusQuit(status,"pthread_mutex_lock","epicsThreadOnceOsd");
        *id = +1; /* +1 => func() done (see epicsThreadOnce() macro defn) */
    }
    status = pthread_mutex_unlock(&onceLock);
    checkStatusQuit(status,"pthread_mutex_unlock","epicsThreadOnceOsd");
}

Original Mantis Bug: mantis-207
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=207

Tags: libcom 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

The failure is reported in the POSIX version of epicsThreadOnceOsd. I suspect that there are orderly shutdown issues (some facility has shutdown prior to when a facility that is using it has shutdown). If you know how to use debuggers and can obtain a stack trace you might be able to speed up the debugging process.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Chris Slominski

    With regards to epics mantis-207 on version 3.14.7 (hpux-parisc), I have now seen the same diagnostic when using the caget that comes bundled with the distribution. This eliminates the suspicion I had about my test client having a bug.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I am attempting to wrap up EPICS R3.14.8. I see that we
still have mantis 207 outstanding - a problem detected on HPUX.

Here is the initial bug report.

> The test program works fine when I build against our
> existing EPICS 3.13.2 installation, but there is a problem
> running with the newer EPICS version. The program performs the
> channels access operations correctly, but the signal value
> prints are sometimes followed by "epicsThreadOnceOsd
> epicsMutexLock failed" written to the command shell.
> I suspect the message is generated when I close the channel
> and exit the program. The number of times the message is written
> varies from zero to two. More often it is zero, and only once have
> I seen two. Can anyone explain to me what this message means, and
> what I need to do to correct it?

And another update.

> With regards to epics mantis-207 on version 3.14.7 (hpux-parisc),
> I have now seen the same diagnostic when using the caget that
> comes bundled with the distribution. This eliminates the suspicion
> I had about my test client having a bug.

I see that caget does not call ca_context_destroy (nor the legacy ca_task_exit).

I suspect that what has occurred is that CA threads are still
running when your program exits. These problems probably occur when
these thread continue to run after the C++ destructors for file scope
objects run. That could cause the carpet to be yanked out
from under these auxillary threads.

So at this point I see these possible options:

1) Call ca_context_destroy from all such applications prior to
calling exit (or returning from main).

2) Obtain a stack trace from within your debugger for the failure
occuring from inside exit() - that can be hard to do depending on the
quality of HPUX debuggers. With the stack trace I might be able to
see better what is occurring and install a workaround. Given that
this is occcurring on HPUX and not on other systems then I doubt
that I can make progress on a workaround w/o a stack trace.

No doubt that some developers would not list (2) as an option and
claim that lack of (1) was the cause (pass the buck to the
application).

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Chris,

> The software I have used when working with epics 3.14.7
> does call the legacy ca_task_exit() function. In fact, if I
> remember correctly, that function is precisely where it
> hangs. Are you saying that call needs to be replaced by
> ca_context_destroy() ?

To proceed we will need a stack trace as I dont have HPUX here.

Revision history for this message
Andrew Johnson (anj) wrote :

No response in > 12 months.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.9 cleanup.

Revision history for this message
Nick Rees (nick-rees) wrote :

This isn't just HP-UX. If I do:

for (( i=0; i<100; i++ )) ; do caget CS-CS-MSTAT-01:MODE; echo $?; done

On RHEL5/R3.14.8.2 this generates the message

epicsThreadOnceOsd epicsMutexLock failed.

(usually, but not always, with exit status 255) about every 20 caget's.

If I do the same thing on RHEL5/R3.14.11 it doesn't seem to happen (at least not in 500 tries).

Whatever it is, it seems to have got better with R3.14.11. If someone still wants me to generate a stack trace I can, but maybe we can close it with a message.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Bob Soliday wrote:
I believe this was solved with entry 334 in Mantis. R3.14.11 was the first release with the fix. The problem occurred when I connected to PVs from 2 or more IOCs and it happened when I was exiting the program. It only happened rarely, but enough to be a big problem for us.

mantis-334 = bug #541362

summary: - shutdown order problem on HPUX
+ Shutdown order problem ("epicsMutexLock failed" on client exit)
Changed in epics-base:
status: Invalid → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.