Some data races found using Helgrind

Bug #1479316 reported by Ambroz Bizjak on 2015-07-29
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Low
mdavidsaver
3.14
Undecided
Unassigned
3.15
Undecided
Unassigned
3.16
Low
mdavidsaver

Bug Description

Hello,

I've been testing my application with Valgrind's Helgrind, a thread debugging tool. I have six patches for various data races. Most of them appear benign, but as far as I know, even the "benign" ones can still theoretically cause problems, possibly due to compiler optimizations. Below are short descriptions of patches and the Helgrind warnings. These patches are generated from a modified version of R3.14.11, but probably apply fine to latest 3.14. I have checked for all issues if they appear in 3.15 branch; all but the dbCaCtl issue do.

Note that I've seen a few other warnings from Helgrind related to Base code but I believe they were false positives for all practical purposes. Notable ones were data races where volatile is already used (I think that's fine when used carefully), and many warnings about inconsistent lock ordering related to dbScanLock (where deadlock is not possible due to use of pthread_mutex_trylock; Helgrind is known to give false positives in these cases).

repeaterSubscribeTimer.patch
I've added a mutex to protect some variables in repeaterSubscribeTimer.
It's not nice to keep the lock while printing text/errors but it's the easiest solution I could find.
This patch stops the Helgrind warning.

==22259== Possible data race during write of size 1 at 0x4DDEB8C by thread #28
==22259== Locks held: none
==22259== at 0x4D5BE0: repeaterSubscribeTimer::confirmNotify() (repeaterSubscribeTimer.cpp:101)
==22259== by 0x4C5B4F: udpiiu::repeaterAckAction(ca_hdr const&, osiSockAddr const&, epicsTime const&) (udpiiu.cpp:752)
==22259== by 0x4C5829: udpiiu::postMsg(osiSockAddr const&, char*, unsigned long, epicsTime const&) (udpiiu.cpp:857)
==22259== by 0x4C77A8: udpRecvThread::run() (udpiiu.cpp:394)
==22259== by 0x4E50F0: epicsThreadCallEntryPoint (epicsThread.cpp:85)
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==22259== by 0x310BA079D0: start_thread (pthread_create.c:301)
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==
==22259== This conflicts with a previous read of size 1 by thread #26
==22259== Locks held: none
==22259== at 0x4D5E15: repeaterSubscribeTimer::expire(epicsTime const&) (repeaterSubscribeTimer.cpp:85)
==22259== by 0x4F0232: timerQueue::process(epicsTime const&) (timerQueue.cpp:140)
==22259== by 0x4EEB1A: timerQueueActive::run() (timerQueueActive.cpp:91)
==22259== by 0x4E50F0: epicsThreadCallEntryPoint (epicsThread.cpp:85)
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==22259== by 0x310BA079D0: start_thread (pthread_create.c:301)
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==
==22259== Address 0x4DDEB8C is 66748 bytes inside a block of size 66920 alloc'd
==22259== at 0x4A0758C: operator new(unsigned long) (vg_replace_malloc.c:298)
==22259== by 0x4BD7D9: cac::createChannel(epicsGuard<epicsMutex>&, char const*, cacChannelNotify&, unsigned int) (cac.cpp:509)
==22259== by 0x4B9460: oldChannelNotify::oldChannelNotify(epicsGuard<epicsMutex>&, ca_client_context&, char const*, void (*)(connection_handler_args), void*, unsigned int) (oldChannelNotify.cpp:54)
==22259== by 0x4AF4AC: ca_create_channel (access.cpp:333)
==22259== - output omitted -
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==22259== by 0x310BA079D0: start_thread (pthread_create.c:301)
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==

dbCaCtl.patch
I believe dbCaCtl must be volatile. I've seen similar "ctl" cases where volatile is already used.
The Helgrind warning remains though.

==10904== Possible data race during write of size 4 at 0x756B08 by thread #1
==10904== Locks held: none
==10904== at 0x498C87: dbCaRun (dbCa.c:202)
==10904== by 0x471046: iocRun (iocInit.c:169)
==10904== by 0x471D68: iocInit (iocInit.c:87)
==10904== by 0x4DDE60: iocshBody (iocsh.cpp:765)
==10904== by 0x40E521: main
==10904==
==10904== This conflicts with a previous read of size 4 by thread #36
==10904== Locks held: none
==10904== at 0x499168: dbCaTask (dbCa.c:845)
==10904== by 0x4EA776: start_routine (osdThread.c:385)
==10904== by 0x4A0B819: mythread_wrapper (in /nix/store/60k21lfmxi1z8x1wmhawg24q6n4rba5n-valgrind-3.10.1/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==10904== by 0x310BA079D0: start_thread (pthread_create.c:301)
==10904== by 0x310B6E88FC: clone (clone.S:115)
==10904== Address 0x756b08 is 0 bytes inside data symbol "dbCaCtl"

epicsSingletonMutex.patch
The assert needs to be moved inside the lock.
This patch stops the Helgrind warning.

==3431== Possible data race during read of size 8 at 0x758F08 by thread #27
==3431== Locks held: 1, at address 0x4DC07D8
==3431== at 0x4D92F6: SingletonUntyped::decrRefCount(void (*)(void*)) (epicsSingletonMutex.cpp:55)
==3431== by 0x4EEFD1: ~reference (epicsSingleton.h:104)
==3431== by 0x4EEFD1: epicsTimerQueueActive::allocate(bool, unsigned int) (timerQueueActive.cpp:42)
==3431== by 0x4BFBB2: cac::cac(epicsMutex&, epicsMutex&, cacContextNotify&) (cac.cpp:150)
==3431== by 0x4B5D2D: ca_client_context::ca_client_context(bool) (ca_client_context.cpp:92)
==3431== by 0x4AEF35: ca_context_create (access.cpp:206)
==3431== by 0x41370A: - output omitted -
==3431== by 0x4EA786: start_routine (osdThread.c:385)
==3431== by 0x4A0B819: mythread_wrapper (in /nix/store/60k21lfmxi1z8x1wmhawg24q6n4rba5n-valgrind-3.10.1/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==3431== by 0x310BA079D0: start_thread (pthread_create.c:301)
==3431== by 0x310B6E88FC: clone (clone.S:115)
==3431==
==3431== This conflicts with a previous write of size 8 by thread #1
==3431== Locks held: 1, at address 0x4D7B918
==3431== at 0x4D93E1: SingletonUntyped::incrRefCount(void* (*)()) (epicsSingletonMutex.cpp:50)
==3431== by 0x4ED7F0: reference (epicsSingleton.h:85)
==3431== by 0x4ED7F0: getReference (epicsSingleton.h:205)
==3431== by 0x4ED7F0: epicsTimerQueueAllocate (epicsTimer.cpp:215)
==3431== by 0x4971DD: callbackInitOnce (callback.c:121)
==3431== by 0x4EB230: epicsThreadOnce (osdThread.c:445)
==3431== by 0x47174D: iocBuild (iocInit.c:115)
==3431== by 0x471D58: iocInit (iocInit.c:87)
==3431== by 0x4DDE70: iocshBody (iocsh.cpp:765)
==3431== by 0x40E521: main
==3431== Address 0x758f08 is 8 bytes inside data symbol "timerQueueMgrEPICS"

timerQueueActive.patch
There seems to be a problem with starting a thread in the constructor which calls back a virtual function.
I've changed the code to finish constructing timerQueueActive before starting the thread. Note, epicsTimerQueueActiveForC seems to be the only direct use of timerQueueActive.
This patch stops the Helgrind warning. But I'm not so sure if it really is an issue formally.

==22259== Possible data race during read of size 8 at 0x4D74D18 by thread #5
==22259== Locks held: none
==22259== at 0x4E50EB: epicsThreadCallEntryPoint (epicsThread.cpp:85)
==22259== by 0x4EA696: start_routine (osdThread.c:385)
==22259== by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==22259== by 0x310BA079D0: start_thread (pthread_create.c:301)
==22259== by 0x310B6E88FC: clone (clone.S:115)
==22259==
==22259== This conflicts with a previous write of size 8 by thread #1
==22259== Locks held: 1, at address 0x4D74C38
==22259== at 0x4ED584: epicsTimerQueueActiveForC::epicsTimerQueueActiveForC(epicsSingleton<timerQueueActiveMgr>::reference&, bool, unsigned int) (epicsTimer.cpp:72)
==22259== by 0x4EF5E8: timerQueueActiveMgr::allocate(epicsSingleton<timerQueueActiveMgr>::reference&, bool, unsigned int) (timerQueueActiveMgr.cpp:50)
==22259== by 0x4ED6ED: epicsTimerQueueAllocate (epicsTimer.cpp:216)
==22259== by 0x440882: asynInit (asynManager.c:461)
==22259== by 0x445533: registerPort (asynManager.c:519)
==22259== by 0x4258B1: asynPortDriver::asynPortDriver(char const*, int, int, int, int, int, int, int, int) (asynPortDriver.cpp:2949)
==22259== - output omitted -

osiLocalAddr.patch
The local address determination seems to be unprotected.
To fix it, I have used the same approach with epicsThreadOnce as is used in osdInterrupt.c.
This patch stops the Helgrind warning.

==14531== Possible data race during read of size 1 at 0x758E60 by thread #29
==14531== Locks held: 2, at addresses 0x4E80D38 0x4DC5C08
==14531== at 0x4EBFAE: osiLocalAddr (osdNetIntf.c:252)
==14531== by 0x4C5EA6: caRepeaterRegistrationMessage(int, unsigned int, unsigned int) (udpiiu.cpp:451)
==14531== by 0x4C5F44: udpiiu::M_repeaterTimerNotify::repeaterRegistrationMessage(unsigned int) (udpiiu.cpp:413)
==14531== by 0x4D5EC9: repeaterSubscribeTimer::expire(epicsTime const&) (repeaterSubscribeTimer.cpp:84)
==14531== by 0x4F0352: timerQueue::process(epicsTime const&) (timerQueue.cpp:140)
==14531== by 0x4EEC3A: timerQueueActive::run() (timerQueueActive.cpp:95)
==14531== by 0x4E51E0: epicsThreadCallEntryPoint (epicsThread.cpp:85)
==14531== by 0x4EA786: start_routine (osdThread.c:385)
==14531== by 0x4A0B819: mythread_wrapper (in /nix/store/60k21lfmxi1z8x1wmhawg24q6n4rba5n-valgrind-3.10.1/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==14531== by 0x310BA079D0: start_thread (pthread_create.c:301)
==14531== by 0x310B6E88FC: clone (clone.S:115)
==14531==
==14531== This conflicts with a previous write of size 1 by thread #26
==14531== Locks held: 2, at addresses 0x4DC0438 0x4E03B98
==14531== at 0x4EC0C9: osiLocalAddr (osdNetIntf.c:318)
==14531== by 0x4C5EA6: caRepeaterRegistrationMessage(int, unsigned int, unsigned int) (udpiiu.cpp:451)
==14531== by 0x4C5F44: udpiiu::M_repeaterTimerNotify::repeaterRegistrationMessage(unsigned int) (udpiiu.cpp:413)
==14531== by 0x4D5EC9: repeaterSubscribeTimer::expire(epicsTime const&) (repeaterSubscribeTimer.cpp:84)
==14531== by 0x4F0352: timerQueue::process(epicsTime const&) (timerQueue.cpp:140)
==14531== by 0x4EEC3A: timerQueueActive::run() (timerQueueActive.cpp:95)
==14531== by 0x4E51E0: epicsThreadCallEntryPoint (epicsThread.cpp:85)
==14531== by 0x4EA786: start_routine (osdThread.c:385)
==14531== Address 0x758e60 is 0 bytes inside data symbol "init.4944"

freeListItemsAvail.patch
I've added a mutex when reading the number of free items.
This function is used to check if there are enough buffers when a CA client connects. I believe that this check is otherwise okay, that there is no risk of buffers no longer being available when they subsequently allocated, because only a single thread accepts connections.

==25096== Possible data race during read of size 8 at 0x7BBC068 by thread #45
==25096== Locks held: none
==25096== at 0x4DBF80: freeListItemsAvail (freeListLib.c:149)
==25096== by 0x4729DC: create_client (caservertask.c:711)
==25096== by 0x472B9B: create_tcp_client (caservertask.c:843)
==25096== by 0x4733A2: req_server (caservertask.c:207)
==25096== by 0x4EA786: start_routine (osdThread.c:385)
==25096== by 0x4A0B819: mythread_wrapper (in /nix/store/60k21lfmxi1z8x1wmhawg24q6n4rba5n-valgrind-3.10.1/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==25096== by 0x310BA079D0: start_thread (pthread_create.c:301)
==25096== by 0x310B6E88FC: clone (clone.S:115)
==25096==
==25096== This conflicts with a previous write of size 8 by thread #58
==25096== Locks held: 1, at address 0x7BBC0C8
==25096== at 0x4DC01F: freeListFree (freeListLib.c:126)
==25096== by 0x47277E: destroy_client (caservertask.c:560)
==25096== by 0x473D89: camsgtask (camsgtask.c:164)
==25096== by 0x4EA786: start_routine (osdThread.c:385)
==25096== by 0x4A0B819: mythread_wrapper (in /nix/store/60k21lfmxi1z8x1wmhawg24q6n4rba5n-valgrind-3.10.1/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==25096== by 0x310BA079D0: start_thread (pthread_create.c:301)
==25096== by 0x310B6E88FC: clone (clone.S:115)
==25096== Address 0x7bbc068 is 24 bytes inside a block of size 40 alloc'd
==25096== at 0x4A08D16: calloc (in /nix/store/60k21lfmxi1z8x1wmhawg24q6n4rba5n-valgrind-3.10.1/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==25096== by 0x4E22DC: callocMustSucceed (cantProceed.c:27)
==25096== by 0x4DC205: freeListInitPvt (freeListLib.c:42)
==25096== by 0x472EE5: rsrv_init (caservertask.c:250)
==25096== by 0x471BB5: iocBuild (iocInit.c:151)
==25096== by 0x471D58: iocInit (iocInit.c:87)
==25096== by 0x4DDE70: iocshBody (iocsh.cpp:765)
==25096== by 0x40E521: main
==25096== Block was alloc'd by thread #1

Ambroz Bizjak (ambroz-bizjak) wrote :
Ambroz Bizjak (ambroz-bizjak) wrote :
Ambroz Bizjak (ambroz-bizjak) wrote :
Ambroz Bizjak (ambroz-bizjak) wrote :
Ambroz Bizjak (ambroz-bizjak) wrote :
Ambroz Bizjak (ambroz-bizjak) wrote :
Ambroz Bizjak (ambroz-bizjak) wrote :

Ah, never mind the dbCaCtl issue, it is already fixed in latest 3.14 :)

mdavidsaver (mdavidsaver) wrote :

From a quick glance these all look like valid issues. The repeaterSubscribeTimer patch needs a closer look as I don't think removing const is the best approach.

Changed in epics-base:
importance: Undecided → Low
assignee: nobody → mdavidsaver (mdavidsaver)
mdavidsaver (mdavidsaver) wrote :

Directly applied epicsSingletonMutex.patch, timerQueueActive.patch, and freeListItemsAvail.patch. repeaterSubscribeTimer.patch applied with modifications. For osiLocalAddr.patch simplified just use thread once.

Changed in epics-base:
milestone: none → 3.16.branch
status: New → Fix Committed
Ralph Lange (ralph-lange) wrote :

These patches were aimed at 3.14, so IMHO they should be applied there and then be merged into 3.15 and 3.16, as usual.

Andrew Johnson (anj) wrote :

Thank-you for noticing that Ralph, I agree. We normally fix bugs on the oldest supported branch on which they appear.

Ambroz Bizjak (ambroz-bizjak) wrote :

Thanks for fixing these. I have looked in the 3.16 changes and have just one comment. It seems like in WIN32/osdNetIntf.c, free (pIfinfoList); was forgotten in the successful case.

mdavidsaver (mdavidsaver) wrote :

> free (pIfinfoList); was forgotten in the successful case.

Good catch. Corrected.

> These patches were aimed at 3.14

Wasn't thinking about this, sorry. That said, I don't think backporting is necessary for all. As an example, at present I think that osiLocalAddr() is only ever called from a single thread (repeater's timer queue).

To me, timerQueueActive.patch, epicsSingletonMutex.patch, and freeListItemsAvail.patch should be applied to 3.14.

Andrew Johnson (anj) wrote :

I have just applied timerQueueActive.patch, epicsSingletonMutex.patch and freeListItemsAvail.patch to the 3.14 branch.

Should we do anything with repeaterSubscribeTimer.patch or osiLocalAddr.patch on the 3.14 branch?

Ambroz Bizjak (ambroz-bizjak) wrote :

> As an example, at present I think that osiLocalAddr() is only ever called from a single thread (repeater's timer queue).
I believe I was seeing this in 3.14 when the calls were done on behalf of different CA contexts. In addition to dbCA, I suspect the Sequencer uses its own context, and custom modules may as well.

mdavidsaver (mdavidsaver) wrote :

The three you applied all seem to me straight forward fixes without much chance of side-effects.

With repeaterSubscribeTimer.patch I'm a little concerned about adding a lock to repeaterSubscribeTimer::expire as there are other locks and callbacks involved. I couldn't find anything which deadlocks, but I don't claim to understand this code all that well.

With osiLocalAddr.patch I made some changes to the WIN32 implementation which I don't think are adaquately tested.

IMO more eyes are needed before applying these to stable branches.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers