EPICS Base

thread joinable race

Bug #1866651 reported by mdavidsaver on 2020-03-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	Fix Released	Critical	mdavidsaver	EPICS Base 7.0.4

Bug Description

https://epics.anl.gov/core-talk/2020/msg00333.php

In following up on a strange CI test failure I noticed during the recent codeathon[1]
I realized a mistake I made in adding epicsThreadMustJoin() [2]. This change
introduced a reference counter to struct epicsThreadOSD. The bug is in
(conditionally) incrementing the ref counter after pthread_create().
This allows a short-lived thread which attempts to self-join to race for a double free().
And it happens that epicsThreadTest does this.

The fix is I think straight forward [3]. I'm wondering how severe this issue should be considered?
It's a race which can cause a crash at runtime. However, the circumstances seem not so common.

[1] https://travis-ci.org/mdavidsaver/epics-base/jobs/649447749#L6255-L6261

> Dumping a stack trace of thread '_main_':
> [ 0x7f9a9a027ade]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(epicsStackTrace+0x5e)
> [ 0x7f9a9a017d97]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(cantProceed+0xb7)
> [ 0x7f9a9a023273]: /home/travis/build/mdavidsaver/epics-base/lib/linux-x86_64/libCom.so.3.17.7(epicsThreadMustJoin+0x93)
> [ 0x401b5a]: ./epicsThreadTest(main+0x40a)
> [ 0x7f9a992d1b97]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)
> [ 0x40168a]: ./epicsThreadTest(_start+0x2a)

Or sometimes locally w/ valgrind

> Process terminating with default action of signal 11 (SIGSEGV)
> Access not within mapped region at address 0x8
> at 0x487C0A7: ellDelete (ellLib.c:81)
> by 0x489B8E5: free_threadInfo (osdThread.c:217)
> by 0x489CF06: epicsThreadMustJoin (osdThread.c:656)
> by 0x10A835: (anonymous namespace)::joinTests(void*) (epicsThreadTest.cpp:118)
> by 0x489C463: start_routine (osdThread.c:411)
> by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
> by 0x4E08FA2: start_thread (pthread_create.c:486)
> by 0x4D394CE: clone (clone.S:95)

Or sometimes locally w/ gdb

> malloc(): unsorted double linked list corrupted

occurring during a subsequent create_threadInfo()

[2] https://code.launchpad.net/~epics-core/epics-base/+git/Com/+merge/361379

[3] https://github.com/mdavidsaver/epics-base/commit/02a24a144d0c062311212c769926c1e2df5a1a52