CA client crash while doing the rsrv tests

Bug #1580623 reported by mdavidsaver
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Critical
Unassigned
3.14
Fix Released
Undecided
Andrew Johnson
3.15
Fix Released
Critical
Ralph Lange
3.16
Fix Released
Critical
Unassigned

Bug Description

Heinz Junkes reported the following when he ran the rsrv tests on an Ubuntu 64 box:

> Tests on (output of ifconfig attached)
> inet addr:10.20.30.1 Bcast:10.20.30.255
> and
> inet addr:141.14.133.59 Bcast:141.14.143.255
>
> All tests ok exept for the the very first “caget" on the client I saw this:
>
>
> epics@kannsein:~/EPICS/base-3.15.4-pre1$ bin/linux-x86_64/caget test
> CA.Client.Exception...............................................
> Warning: "Identical process variable names on multiple servers"
> Context: "Channel: "test", Connecting to: localhost:57241, Ignored: kannsein.rz-berlin.mpg.de:57241"
> Source File: ../cac.cpp line 1298
> Current Time: Sun May 08 2016 15:41:26.945259053
> ..................................................................
> test 18
> pure virtual method called
> terminate called without an active exception
> Aborted (core dumped)

Related branches

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

On 05/08/2016 03:53 PM, Heinz Junkes wrote:
> (gdb) thread apply all backtrace
>
> Thread 3 (Thread 0x7f26831a6780 (LWP 2408)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 ()
> at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x00007f2682afe27b in epicsEventWait (pevent=0x7f267c0012a0)
> at ../../../src/libCom/osi/os/posix/osdEvent.c:103
> #2 0x00007f2682af7269 in epicsEventMustWait (id=<optimized out>)
> at ../../../src/libCom/osi/epicsEvent.cpp:125
> #3 0x00007f2682af234c in epicsExitCallAtExitsPvt (pep=<optimized out>)
> at ../../../src/libCom/misc/epicsExit.c:95
> #4 epicsExitCallAtExits () at ../../../src/libCom/misc/epicsExit.c:113
> #5 0x00007f26827305ea in __cxa_finalize (d=0x7f2682d28460)
> at cxa_finalize.c:56
> #6 0x00007f2682adc283 in __do_global_dtors_aux ()
> from /home/epics/EPICS/base-3.15.4-pre1/lib/linux-x86_64/libCom.so.3.15.4
> #7 0x00007ffe04b8b670 in ?? ()
> #8 0x00007f2682fa673a in _dl_fini () at dl-fini.c:252
> Backtrace stopped: frame did not save the PC
>
> Thread 2 (Thread 0x7f26800fb700 (LWP 2415)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 ()
> at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x00007f2682afe27b in epicsEventWait (pevent=0x7f267c001000)
> at ../../../src/libCom/osi/os/posix/osdEvent.c:103
> ---Type <return> to continue, or q <return> to quit---
> #2 0x00007f2682af7269 in epicsEventMustWait (id=<optimized out>)
> at ../../../src/libCom/osi/epicsEvent.cpp:125
> #3 0x00007f2682ae6765 in errlogThread ()
> at ../../../src/libCom/error/errlog.c:569
> #4 0x00007f2682afbabc in start_routine (arg=0x7f267c0048b0)
> at ../../../src/libCom/osi/os/posix/osdThread.c:403
> #5 0x00007f26824de182 in start_thread (arg=0x7f26800fb700)
> at pthread_create.c:312
> #6 0x00007f26827ee47d in clone ()
> at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> Thread 1 (Thread 0x7f2680c39700 (LWP 2410)):
> #0 0x00007f268272acc9 in __GI_raise (sig=sig@entry=6)
> at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1 0x00007f268272e0d8 in __GI_abort () at abort.c:89
> #2 0x00007f2682232535 in __gnu_cxx::__verbose_terminate_handler() ()
> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #3 0x00007f26822306d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #4 0x00007f2682230703 in std::terminate() ()
> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #5 0x00007f26822311bf in __cxa_pure_virtual ()
> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #6 0x00007f2682af4529 in ipAddrToAsciiEnginePrivate::run (this=0x1c95640)
> at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:273
> ---Type <return> to continue, or q <return> to quit---
> #7 0x00007f2682af5df9 in epicsThreadCallEntryPoint (pPvt=0x1c95a98)
> at ../../../src/libCom/osi/epicsThread.cpp:85
> #8 0x00007f2682afbabc in start_routine (arg=0x1c95e00)
> at ../../../src/libCom/osi/os/posix/osdThread.c:403
> #9 0x00007f26824de182 in start_thread (arg=0x7f2680c39700)
> at pthread_create.c:312
> #10 0x00007f26827ee47d in clone ()
> at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I think the fact that epicsExitCallAtExitsPvt is running is actually
coincidental. The order of atexit hooks for 'caget' is

> atExit errlogExitHandler((nil))
> atExit cacExitHandler((nil))
> atExit ca_client_exit_handler((nil))
> atExit ClockTime_Shutdown((nil))

Since the errlog thread is still waiting on its event I think
errlogExitHandler() hasn't returned.

Before the changes for lp:1527636, with 'caget' the name lookup daemon
thread was stopped during ca_context_destroy() before main() returned.

The crash is coming from ipAddrToAsciiAsynchronous.cpp:273

> this->pCurrent->pCB->transactionComplete ( this->nameTmp );

My suspicion is that one of these will show corruption as something has
been incorrectly/incompletely delete'd. If not, it should tell us which
of the two sub-classes of ipAddrToAsciiCallBack is involved.

I can think of two possibilities, either there is a logic error in
~ipAddrToAsciiTransactionPrivate allowing the dtor to complete w/o
removing itself from a tsDLList, or that the dtor isn't being called at all.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I am not able to reproduce this crash. However, Ralph also reports seeing it, and this report by hdante on github looks to be the same (https://github.com/epics-base/epics-base/pull/2).

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Valgrind output from hdante. In this case the msgForMultiplyDefinedPV sub-class is involved.

Revision history for this message
Heinz Junkes (junkes) wrote :

I still investigate the (for me) random behaviour of the ipAddrToAsciiAsynchronous.cpp class.
I assume dns-delays etc. As I am new I will need some more days to learn which thread gets fired when and why. It seems to depend on msgForMultiplyDefinedPV class.
Just a update. I am still working on it.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I'm still not able to reproduce this issue. I can think of two possible source of this crash. Either a logic error ~ipAddrToAsciiTransactionPrivate or this dtor is never called.

The attached patch simplifies the situation by removing the use of some free list and new+delete operator overloading.

Heinz, can you see if this patch changes anything? Also, can you say whether ~ipAddrToAsciiTransactionPrivate has been called via. ~msgForMultiplyDefinedPV before the crash occurs?

Revision history for this message
Ralph Lange (ralph-lange) wrote :

I can easily reproduce it on a CCSv5.3b1 (RHEL 6.5) fast controller (= physical machine)
Run a soft IOC with a single ai record, then do caget:

langer @ dain-fc1.codac.iter.org : ~ $ caget ai
ai nan
pure virtual method called

or

langer @ dain-fc1.codac.iter.org : ~ $ caget ai
ai nan
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

Approx 30% of the calls are failing in one of those ways.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Also on caput, a different failure:

langer @ dain-fc1.codac.iter.org : ~ $ caput ai 3
Old : ai 3
New : ai 3
pure virtual method called
[ 14:01:12 ]
langer @ dain-fc1.codac.iter.org : ~ $ caput ai 3
Old : ai 3
New : ai 3
epicsMutex pthread_mutex_destroy failed: error Device or resource busy
[ 14:01:13 ]
langer @ dain-fc1.codac.iter.org : ~ $ caput ai 3
Old : ai 3
New : ai 3
epicsMutex pthread_mutex_destroy failed: error Device or resource busy
[ 14:01:15 ]

Revision history for this message
Ralph Lange (ralph-lange) wrote :

On a different machine (virtual) with the same software:

langer @ NEXT-2.codac.iter.org : ~ $ caget ai
ai nan
CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "ai", Connecting to: localhost:45814, Ignored: next-2.codac.iter.org:45814"
    Source File: ../cac.cpp line 1298
    Current Time: Wed May 18 2016 14:05:50.704295735
..................................................................
epicsThread: Unexpected C++ exception "epicsMutex::invalidMutex()" with type "N10epicsMutex12invalidMutexE" in thread "ipToAsciiProxy" at Wed May 18 2016 14:05:50.704651075
terminate called after throwing an instance of 'epicsMutex::invalidMutex'

Revision history for this message
Heinz Junkes (junkes) wrote :

Applied Michaels patch from 2016-05-16. No more core dumps. Only one additional interface (beside localhost) will be reported: "Identical process variable named on multiple servers". Before the patch sometimes additional interfaces were reported too (but not reliable).
Before the patch there where some delays in the output. Now it runs as fast as can be. From my point of view perfect and smooth.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Heinz, can you run with valgrind to see if it reports leaks involving msgForMultiplyDefinedPV?

Ralph, can you confirm?

If this patch changes things, it seems likely that ~msgForMultiplyDefinedPV is/was never called due to some bug involving free lists, custom "operator delete", and explicitly calling the dtor.

So what now? Do we take the time to untangle this bug to retain the free-list, apply my patch which does explicit new/delete w/o a free-list, revert the earlier patch which uncovered this bug, or something else?

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Have to leave in 2 minutes ... will try to verify tomorrow am.

If we see a chance to untangle/fix next week, I will just tag -rc1 as it is.

Revision history for this message
Heinz Junkes (junkes) wrote :
Download full text (7.2 KiB)

Running caget with valgrind again the other interface is advertised too:

epics@kannsein:~/EPICS/base-3.15.4-pre1$ valgrind bin/linux-x86_64/caget test
==95974== Memcheck, a memory error detector
==95974== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==95974== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==95974== Command: bin/linux-x86_64/caget test
==95974==
CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "test", Connecting to: 127.0.0.1:51041, Ignored: kannsein.rz-berlin.mpg.de:51041"
    Source File: ../cac.cpp line 1306
    Current Time: Thu May 19 2016 17:27:25.270869640
..................................................................
"
    Source File: ../cac.cpp line 1306
    Current Time: Thu May 19 2016 17:27:25.038468686
..................................................................
CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "test", Connecting to: 127.0.0.1:51041, Ignored: 10.20.30.1:51041test 74
==95974==
==95974== HEAP SUMMARY:
...

Without valgrind:

epics@kannsein:~/EPICS/base-3.15.4-pre1$ bin/linux-x86_64/caget test
CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "test", Connecting to: localhost:51041, Ignored: kannsein.rz-berlin.mpg.de:51041"
    Source File: ../cac.cpp line 1306
    Current Time: Thu May 19 2016 17:28:34.051648189
..................................................................
test 143

There are none reported leaks for msgForMultiplyDefinedPV:

epics@kannsein:~/EPICS/base-3.15.4-pre1$ valgrind --leak-check=full bin/linux-x86_64/caget test
==96056== Memcheck, a memory error detector
==96056== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==96056== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==96056== Command: bin/linux-x86_64/caget test
==96056==
"
    Source File: ../cac.cpp line 1306
    Current Time: Thu May 19 2016 17:32:22.206396366
..................................................................
CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "test", Connecting to: 127.0.0.1:51041, Ignored: kannsein.rz-berlin.mpg.de:51041"
    Source File: ../cac.cpp line 1306
    Current Time: Thu May 19 2016 17:32:22.097905618
..................................................................
CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "test", Connecting to: 127.0.0.1:51041, Ignored: 10.20.30.1:51041test 371
==96056==
==96056== HEAP SUMMARY:
==96056== in use at exit: 25,410 bytes in 264 blocks
==96056== total heap usage: 473 allocs, 209 frees, 1,062,862 bytes allocated
==96056==
==96056== ...

Read more...

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Observations:

- Crashes only seem to happen with multiple interfaces = "multiply defined PV" messages.
  Configuring the client to use only one NIF (localhost) and AUTO_ADDR=NO I cannot reproduce the issue.
  Setting AUTO_ADDR=YES I see segmentation faults and
    epicsThread: Unexpected C++ exception "epicsMutex::invalidMutex()" with type "N10epicsMutex12invalidMutexE" in thread "ipToAsciiProxy" at ...
  events.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

I can confirm:

After applying Michael's patch of 16 May, I cannot reproduce the faulty behavior anymore.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

> it seems likely that ~msgForMultiplyDefinedPV is/was never called due to some bug involving free lists, custom "operator delete", and explicitly calling the dtor.

I think so.

~msgForMultiplyDefinedPV is explicitly called as part of cac::pvMultiplyDefinedNotify, the callback that is executed when additional DNS answers come in.

Since fixing lp:1527636 removed the phase that waited for all pending requests (and callbacks being executed), the new fast exit means cac::pvMultiplyDefinedNotify is not called.

I think this shows in a fast 'caget' client loop: now I get lots of single line executions, while running the destructor in ipAddrToAsciiEnginePrivate::release() (called from the cac destructor) yields exactly one MultiplyDefined message per call, and no segfaults or exceptions.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

FWIW...

To show up, it needs:
* duplicate PVs (two NIFs or adding localhost)
* a reverse DNS lookup that is slow, but not too slow
  ("best" configuration for the virtual box on my laptop: Google DNS over wireless)
* a short lived client (caget in a shell loop)

My impression: when the DNS answer comes in while the client is shutting down, there is a chance that a callback is executed in a callback object that is deleted (segfaults), or that the DNS call inside a destructed ipAddrToAsciiEngine returns, and taking the guard mutex (when leaving the unguarded section around the DNS call) throws because the mutex has been deleted.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Fix added to 3.15 branch as http://bazaar.launchpad.net/~epics-core/epics-base/3.15/revision/12767
Reason was a race that could lead to dereferencing a zero pointer during an unguarded section in the ipAddrToNameEngine.

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

I backported revision 12767 to 3.14. It compiles, but I couldn't really try it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.