crashes in ipAddrToAsciiAsynchronous (pure virtual method called)

Bug #541244 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Invalid
High
Andrew Johnson

Bug Description

From Kay Kasemir,

One of the ArchiveEngines that Johnny uses under R3.14.7 tends to crash
quite often.
The only special thing about it might be that is currently has several channels that point to a soft IOC which is down, i.e. a few unresolved channels. On the console, CAC keeps printing things like "disconnected",
"connection refused", ...

I now got a core file which shows ipAddrToAsciiAsynchronous
run into a NULL callback pointer. Last words of the engine:
   Virtual circuit disconnect
   pure virtual method called
   core dumped

gdb backtrace:
#0 0xb6eb1c0f in raise () from /lib/tls/libc.so.6
#1 0xb6eb3415 in abort () from /lib/tls/libc.so.6
#2 0xb7079527 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.5 #3 0xb7079574 in std::terminate () from /usr/lib/libstdc++.so.5 #4 0xb7079a77 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.5 #5 0xb716b358 in ipAddrToAsciiEnginePrivate::run (this=0xb6fbea98) at
../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:278
#6 0xb716c158 in epicsThreadCallEntryPoint (pPvt=0x80897e4) at
../../../src/libCom/osi/epicsThread.cpp:59
#7 0xb7172ec3 in start_routine (arg=0x80899c0) at
../../../src/libCom/osi/os/posix/osdThread.c:301
#8 0xb7126dec in start_thread () from /lib/tls/libpthread.so.0 #9 0xb6f65e8a in clone () from /lib/tls/libc.so.6
(gdb) fra 5
...
(gdb) list
275 {
276 epicsGuardRelease < epicsMutex > unguard (
guard );
277 // dont call callback with lock applied
278 this->pCurrent->pCB->transactionComplete (
this->nameTmp );
279 }
(gdb) print *this
$2 = {<ipAddrToAsciiEngine> = {_vptr.ipAddrToAsciiEngine = 0x1349a4},
<epicsThreadRunable> = {_vptr.epicsThreadRunable = 0xb6fc2000},
   nameTmp = "...unreadable garbage ...",
   transactionFreeList = {
     mutex = {<No data fields>}, pFreeList = 0x0, pChunkList = 0x0},
labor = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, mutex = {id = 0x0},
laborEvent = {id = 0x0},
   destructorBlockEvent = {id = 0x0}, thread = {runable = @0x0, id =
0x0, mutex = {id = 0x0}, event = {id = 0x0}, exitEvent = {id = 0x0},
pWaitReleaseFlag = 0x0,
     begin = false, cancel = false, terminated = false}, pCurrent = 0x0,
cancelPendingCount = 0, exitFlag = false, callbackInProgress = false,
static globalMutex = {
     id = 0x80870a0}, static pEngine = 0x80893b8, static
numberOfReferences = 1}

Note that pCurrent is 0, hence this->pCurrent->pCB->transactionComplete
( this->nameTmp );
crashes.

Original Mantis Bug: mantis-202
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=202

Tags: libcom 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

Initial indications point to irregularities here.

The dump below says that the crash occurred at ipAddrToAsciiAsynchronous.cpp:278 yet it also indicates that "this-> callbackInProgress" is false. Looking at the source I know that "this-> callbackInProgress" is set to true just prior to ipAddrToAsciiAsynchronous.cpp:278 and set to false just after that line. The only other line in the source that modifies this variable is the constructor for ipAddrToAsciiEnginePrivate.

Therefore, I can conclude that one of two circumstances is occurring.
1) The implementation of "this->thread.exitWait ();" isn't working correctly (and the destructor is allowed to return before the ipAddrToAsciiEnginePrivate thread exits).
2) Some other agent is illegitimately modifying (corrupting) the ipAddrToAsciiEnginePrivate data structure.

I am inclined to conclude that number (2) is probably the correct deduction because every non-static field in ipAddrToAsciiEnginePrivate is zero. The fields "laborEvent = {id = 0x0}, destructorBlockEvent = {id = 0x0}" both being zero is particularly suspicious. No doubt this might be allowed to occur if "this->thread.exitWait ();" wasn’t working, but since "epicsThread::exitWait()" is depended upon by many threads in base then I am inclined to conclude that the probability of that outcome would be quite small.

edited on: 2005-05-31 10:06

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> I'm running the engine under valgrind.
> None of this should matter in the real world.

The problem always is forcing the bug to be reproduced. Have you seen it crash while the code was instrumented with valgrind?

I have used this type of tool from two different commercial sources and I notice that each of them excelled in different areas. There were real differences in what was detected. That was a few years ago. The technology might be maturing at this point. However, if valgrind isn’t using some form of object code insertion it may be ineffective compared to tools that do. Object code insertion technologies seem to provide a good combination of strong detection with minimized impact on runtime dynamics.

If your DNS system is (was) temporarily flaky, turned off, or inaccessible that could cause dramatic changes in runtime dynamics, and also open up new opportunities for bugs of course.

> What I don't understand: Sometimes that message uses '172.31.72.95' and
> sometimes 'lin-vac-srv01.ics.sns.gov'

For R3.14 I stripped all direct calls to DNS out of the CA. I converted many instances (the most important ones) to use ipAddrToAsciiAsynchronous, but since this was a hassle I didn’t use ipAddrToAsciiAsynchronous for every single one of them. Also, if a high water mark on ipAddrToAsciiAsynchronous's queue is reached, or if ipAddrToAsciiAsynchronous is being shut down when items are pending in its queue, then raw IP addresses are printed.

> What should I do to get ipAddrToAsciiAsynchronous to run and to see
> Messages with resolved names?

Connect to an soft IOC, and then stop the soft IOC. That ought to get a message printed with a host name in it, but as I mentioned before runtime dynamics can change.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> Any hint on what I need to prepare in order to get
> the ipAddrToAsciiAsynchronous code to run?

When the CA client library needs to print a diagnostic message we would prefer to include ASCII names, and not raw IP addresses.

That code interposes an asynchronous wrapper around the DNS API which is a dumb synchronous blocking interface. We have seen in the past that the CA client can be come compromised if the DNS server isn’t available on the network. The client library could get hung for a very long time waiting for a IP address to host name conversion call to complete.

> What's supposed to be behind the callback?

A CA client library specific diagnostic message printing function

> What else is waiting on the 'guard' and thus might cause
> the memory corruption as soon as the 'unguard' runs?

Not much. Primarily, the guard is there to protect the queue from other threads that may be adding or deleting items to it. All of the interactions are in that one source file. Fortunately, it’s a small and easily understood source code.

> Does the chid contain the ipAddrToAsciiAsynchronous memory structure?

The client library obtains a reference to the ipAddrToAsciiAsynchronous facility when it starts up. Multiple client contexts can share the same ipAddrToAsciiAsynchronous instance. The instance is reference counted so the last CA context to go way also destroys the ipAddrToAsciiAsynchronous instance.

The entries in the ipAddrToAsciiAsynchronous queue are dynamically allocated from a free list whenever the client library decides that it needs to print a diagnostic message with a host name in it.

> any idea how the engine would obtain a pointer to it so that it could
> zero > it all out?

My best guess is some sort of random corruption.

> Should I start & stop ca servers to provoke CA disconnects or ??

Couldn’t (shouldn’t) hurt (this wont hurt a bit) ;-)

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Kay,

As a side note, I haven't had another archive engine
crash in the CA client's IP-to-name resolution
with - I think - the same IOC and archive config,
so that might have been caused by the behavior
of the DNS server connection which is no longer reproducible.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Perhaps the next step here will be to test on Linux when the DNS entry isnt available.

I can think of three situations:
1) Configured DNS server does not respond
2) Configured DNS server is slow responding
3) Configured DNS server is responding but the specified host name isnt in the db

I suspect that (3) is frequently encountered, but its conceivable that there is a failure related to one of the other two which was not uncovered during regression testing.

Revision history for this message
Andrew Johnson (anj) wrote :

4 years and 4 versions ago, re-submit if its still a problem.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.11 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.