thread joinable race

Bug #1866651 reported by mdavidsaver
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Critical
mdavidsaver

Bug Description

https://epics.anl.gov/core-talk/2020/msg00333.php

In following up on a strange CI test failure I noticed during the recent codeathon[1]
I realized a mistake I made in adding epicsThreadMustJoin() [2]. This change
introduced a reference counter to struct epicsThreadOSD. The bug is in
(conditionally) incrementing the ref counter after pthread_create().
This allows a short-lived thread which attempts to self-join to race for a double free().
And it happens that epicsThreadTest does this.

The fix is I think straight forward [3]. I'm wondering how severe this issue should be considered?
It's a race which can cause a crash at runtime. However, the circumstances seem not so common.

[1] https://travis-ci.org/mdavidsaver/epics-base/jobs/649447749#L6255-L6261

> Dumping a stack trace of thread '_main_':
> [ 0x7f9a9a027ade]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(epicsStackTrace+0x5e)
> [ 0x7f9a9a017d97]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(cantProceed+0xb7)
> [ 0x7f9a9a023273]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(epicsThreadMustJoin+0x93)
> [ 0x401b5a]: ./epicsThreadTest(main+0x40a)
> [ 0x7f9a992d1b97]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)
> [ 0x40168a]: ./epicsThreadTest(_start+0x2a)

Or sometimes locally w/ valgrind

> Process terminating with default action of signal 11 (SIGSEGV)
> Access not within mapped region at address 0x8
> at 0x487C0A7: ellDelete (ellLib.c:81)
> by 0x489B8E5: free_threadInfo (osdThread.c:217)
> by 0x489CF06: epicsThreadMustJoin (osdThread.c:656)
> by 0x10A835: (anonymous namespace)::joinTests(void*) (epicsThreadTest.cpp:118)
> by 0x489C463: start_routine (osdThread.c:411)
> by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
> by 0x4E08FA2: start_thread (pthread_create.c:486)
> by 0x4D394CE: clone (clone.S:95)

Or sometimes locally w/ gdb

> malloc(): unsorted double linked list corrupted

occurring during a subsequent create_threadInfo()

[2] https://code.launchpad.net/~epics-core/epics-base/+git/Com/+merge/361379

[3] https://github.com/mdavidsaver/epics-base/commit/02a24a144d0c062311212c769926c1e2df5a1a52

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Fix for posix committed as 02a24a144d0c062311212c769926c1e2df5a1a52. Fix for WIN32 in progress.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :
Revision history for this message
Ralph Lange (ralph-lange) wrote :

All at "1 hr" - they're stalling.
At the end of the test result collection phase?

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Yup. Looks like epicsExitTest is hanging. I'm wondering if my use of epicsAtomic* is osdThread is triggering an init (or deinit) deadlock. Haven't had time to investigate this yet.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

WIN32 fix 46fa31020ed4c5d3e4055eb63e4e34ecd341ba0c

It turns out that I'd forgotten to change the joinable flag from a char to an int.
And thanks to the wonders of implicit casting, passing a char* to epicsAtomicCmpAndSwapIntT()
will compile. There was a warning, which I didn't see. Luckily the joinable flag was at end of the struct, and triggered a fault, or I might not have noticed at all.

Changed in epics-base:
status: In Progress → Fix Committed
Andrew Johnson (anj)
Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.