thread joinable race

Bug #1866651 reported by mdavidsaver on 2020-03-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Critical
mdavidsaver

Bug Description

https://epics.anl.gov/core-talk/2020/msg00333.php

In following up on a strange CI test failure I noticed during the recent codeathon[1]
I realized a mistake I made in adding epicsThreadMustJoin() [2]. This change
introduced a reference counter to struct epicsThreadOSD. The bug is in
(conditionally) incrementing the ref counter after pthread_create().
This allows a short-lived thread which attempts to self-join to race for a double free().
And it happens that epicsThreadTest does this.

The fix is I think straight forward [3]. I'm wondering how severe this issue should be considered?
It's a race which can cause a crash at runtime. However, the circumstances seem not so common.

[1] https://travis-ci.org/mdavidsaver/epics-base/jobs/649447749#L6255-L6261

> Dumping a stack trace of thread '_main_':
> [ 0x7f9a9a027ade]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(epicsStackTrace+0x5e)
> [ 0x7f9a9a017d97]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(cantProceed+0xb7)
> [ 0x7f9a9a023273]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(epicsThreadMustJoin+0x93)
> [ 0x401b5a]: ./epicsThreadTest(main+0x40a)
> [ 0x7f9a992d1b97]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)
> [ 0x40168a]: ./epicsThreadTest(_start+0x2a)

Or sometimes locally w/ valgrind

> Process terminating with default action of signal 11 (SIGSEGV)
> Access not within mapped region at address 0x8
> at 0x487C0A7: ellDelete (ellLib.c:81)
> by 0x489B8E5: free_threadInfo (osdThread.c:217)
> by 0x489CF06: epicsThreadMustJoin (osdThread.c:656)
> by 0x10A835: (anonymous namespace)::joinTests(void*) (epicsThreadTest.cpp:118)
> by 0x489C463: start_routine (osdThread.c:411)
> by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
> by 0x4E08FA2: start_thread (pthread_create.c:486)
> by 0x4D394CE: clone (clone.S:95)

Or sometimes locally w/ gdb

> malloc(): unsorted double linked list corrupted

occurring during a subsequent create_threadInfo()

[2] https://code.launchpad.net/~epics-core/epics-base/+git/Com/+merge/361379

[3] https://github.com/mdavidsaver/epics-base/commit/02a24a144d0c062311212c769926c1e2df5a1a52

mdavidsaver (mdavidsaver) wrote :

Fix for posix committed as 02a24a144d0c062311212c769926c1e2df5a1a52. Fix for WIN32 in progress.

Ralph Lange (ralph-lange) wrote :

All at "1 hr" - they're stalling.
At the end of the test result collection phase?

mdavidsaver (mdavidsaver) wrote :

Yup. Looks like epicsExitTest is hanging. I'm wondering if my use of epicsAtomic* is osdThread is triggering an init (or deinit) deadlock. Haven't had time to investigate this yet.

mdavidsaver (mdavidsaver) wrote :

WIN32 fix 46fa31020ed4c5d3e4055eb63e4e34ecd341ba0c

It turns out that I'd forgotten to change the joinable flag from a char to an int.
And thanks to the wonders of implicit casting, passing a char* to epicsAtomicCmpAndSwapIntT()
will compile. There was a warning, which I didn't see. Luckily the joinable flag was at end of the struct, and triggered a fault, or I might not have noticed at all.

Changed in epics-base:
status: In Progress → Fix Committed
Andrew Johnson (anj) on 2020-05-29
Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers