Thread synchronization issue in libCom/osi/os/WIN32/osdTime.cpp
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| EPICS Base |
Medium
|
Unassigned | ||
| 3.14 |
Medium
|
Jeff Hill |
Bug Description
EPICS Base version 3.14.9
File: /src/libCom/
Function: epicsTimerNotif
Line 444: if ( curPerfCounter.
Problem: There is a thread synchronization issue which causes the 'curPerfCounter
How to fix:
We can move line 441 ' EnterCriticalSe
description: | updated |
Andrew Johnson (anj) wrote : | #1 |
Jeff Hill (johill-lanl) wrote : | #2 |
if we move the enter critical section up in the function it doesn't help because that part of the code only accesses the auto (function local) variable.
Nevertheless, I do see a bug. This is hopefully a proper fix.
>> bzr diff
=== modified file 'src/libCom/
--- src/libCom/
+++ src/libCom/
@@ -409,8 +409,8 @@
EnterCriti
- LONGLONG perfCounterDiff = curPerfCounter.
- if ( curPerfCounter.
+ LONGLONG perfCounterDiff;
+ if ( curPerfCounter.
}
else {
Jeff Hill (johill-lanl) wrote : | #3 |
the test results after making that change (I notice no change in the test output before or after this change)
C:\hill\
1..195
ok 1 - epicsEpoch.
ok 2 - epicsEpoch.
ok 3 - epicsEpoch.
ok 4 - epicsEpoch.
ok 5 - epicsEpoch.
ok 6 - diff <= precisionEPICS + precisionNTP
ok 7 - nanosecond overflow throws
ok 8 - undefined => '<undefined>'
ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
ok 10 - '%S.%04f' => '00.0988'
ok 11 - '%S.%05f' => '00.09877'
ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
ok 13 - '%S.%05f' => '00.*'
ok 14 - '%%S.%%05f' => '%S.%05f'
ok 15 - bad format => '<invalid format>'
# Running 10 loops
# 100000 calls to epicsTime:
ok 16 - copy == end
ok 17 - copy <= end
ok 18 - copy >= end
ok 19 - end > begin
ok 20 - end >= begin
ok 21 - begin != end
ok 22 - begin < end
ok 23 - begin <= end
ok 24 - end - end == 0
ok 25 - end - begin ~= diff
ok 26 - begin + 0 == begin
ok 27 - begin + diff == end
ok 28 - end - 0 == end
ok 29 - end - diff == begin
ok 30 - (begin += diff) == end
ok 31 - (end -= diff) == begin
ok 32 - beginANSI + diff == end
ok 33 - beginTS + diff == end
ok 34 - copy == end
ok 35 - copy <= end
ok 36 - copy >= end
ok 37 - end > begin
ok 38 - end >= begin
ok 39 - begin != end
ok 40 - begin < end
ok 41 - begin <= end
ok 42 - end - end == 0
ok 43 - end - begin ~= diff
ok 44 - begin + 0 == begin
ok 45 - begin + diff == end
ok 46 - end - 0 == end
ok 47 - end - diff == begin
ok 48 - (begin += diff) == end
ok 49 - (end -= diff) == begin
ok 50 - beginANSI + diff == end
ok 51 - beginTS + diff == end
ok 52 - copy == end
ok 53 - copy <= end
ok 54 - copy >= end
ok 55 - end > begin
ok 56 - end >= begin
ok 57 - begin != end
ok 58 - begin < end
ok 59 - begin <= end
ok 60 - end - end == 0
ok 61 - end - begin ~= diff
ok 62 - begin + 0 == begin
ok 63 - begin + diff == end
ok 64 - end - 0 == end
ok 65 - end - diff == begin
ok 66 - (begin += diff) == end
ok 67 - (end -= diff) == begin
ok 68 - beginANSI + diff == end
ok 69 - beginTS + diff == end
ok 70 - copy == end
ok 71 - copy <= end
ok 72 - copy >= end
ok 73 - end > begin
ok 74 - end >= begin
ok 75 - begin != end
ok 76 - begin < end
ok 77 - begin <= end
ok 78 - end - end == 0
ok 79 - end - begin ~= diff
ok 80 - begin + 0 == begin
ok 81 - begin + diff == end
ok 82 - end - 0 == end
ok 83 - end - diff == begin
ok 84 - (begin += diff) == end
ok 85 - (end -= diff) == begin
ok 86 - beginANSI + diff == end
ok 87 - beginTS + diff == end
ok 88 - copy == end
ok 89 - copy <= end
ok 90 - copy >= end
ok 91 - end > begin
ok 92 - end >= begin
ok 93 - begin != end
ok 94 - begin < end
ok 95 - begin <= end
ok 96 - end - end == 0
ok 97 - end - begin ~= diff
ok 98 - begin + 0 == begin
ok 99 - begin + diff == end
ok 100 - end - 0 == end
ok 101 - end - diff == begin
ok 102 - (begin += diff) == end
ok 103 - (end -= diff) == begin
ok 104 - beginANSI + diff == end
ok 105 - beginTS + diff == end
ok 106 - copy == end
ok 107 - copy <= end
ok 108 - copy >= end
ok 109 - end > begin
ok 110 - end >= begi...
Jeff Hill (johill-lanl) wrote : | #4 |
BTW: how did you detect that we had an issue
Alex Chen (alex-chen) wrote : Re: [Bug 697509] Re: Thread synchronization issue in libCom/osi/os/WIN32/osdTime.cpp | #5 |
Hey Jeff,
Yes, they are all local variables, but the reason to move up the
'EnterCriticalS
is no race conditions for local variables. The reason is to preserve the
right order to update the 'curPerfCounter' and 'lastPerfCounter', we must
guarantee that the 'lastPerfCounter' is always updated at the end of the
last iteration, and 'curPerfCounter' is updated at the beginning of a new
iteration. But I found this issue:
1. Thread A executes currentTime:
has already filled the local variable 'curPerfCounter' (say 10001) and
wait to enter the critical section which is currently obtained by Thread
B.
2. Thread B executes currentTime:
'lastPerfCounter' (say 10009) at line 402 within the critical section,
then thread B exits the critical section.
3. Thread A gets the chance to enter the critial section just released by
thread B, but it finds the 'lastPerfCounter' is bigger than the
'curPerfCounter', so it will handle it as a roll over case, but it's
actually not a real roll over case.
I found this issue on NI's PharLap platform(a sort of realtime Windows),
but I think it could show up on other platforms as well during concurrency
situations.
BTW, the big headache for me is actually not this bug, but Bug 697516
which throws a std::logic_error exception and causes system to crash.
Thanks,
Alex
From:
Jeff Hill <email address hidden>
To:
<email address hidden>
Date:
01/06/2011 09:45 AM
Subject:
[Bug 697509] Re: Thread synchronization issue in
libCom/
Sent by:
<email address hidden>
if we move the enter critical section up in the function it doesn't help
because that part of the code only accesses the auto (function local)
variable.
Nevertheless, I do see a bug. This is hopefully a proper fix.
>> bzr diff
=== modified file 'src/libCom/
--- src/libCom/
+++ src/libCom/
@@ -409,8 +409,8 @@
EnterCriti
- LONGLONG perfCounterDiff = curPerfCounter.
this->lastPerfC
- if ( curPerfCounter.
+ LONGLONG perfCounterDiff;
+ if ( curPerfCounter.
this->lastPerfC
}
else {
--
You received this bug notification because you are a direct subscriber
of the bug.
https:/
Title:
Thread synchronization issue in libCom/
Status in EPICS Base:
New
Bug description:
EPICS Base version 3.14.9
File: /src/libCom/
Function: epicsTimerNotif
epicsTime & )
Line 444: if ( curPerfCounter.
Problem: There is a thread synchronization issue which causes the
'curPerfCounter
we will handle it as a roll over case. But actually this is not a roll
over case, the ...
Alex Chen (alex-chen) wrote : | #6 |
Did this test run on a multicore machine?
Thanks,
Alex
From:
Jeff Hill <email address hidden>
To:
<email address hidden>
Date:
01/06/2011 09:45 AM
Subject:
[Bug 697509] Re: Thread synchronization issue in
libCom/
Sent by:
<email address hidden>
the test results after making that change (I notice no change in the
test output before or after this change)
C:\hill\
1..195
ok 1 - epicsEpoch.
ok 2 - epicsEpoch.
ok 3 - epicsEpoch.
ok 4 - epicsEpoch.
ok 5 - epicsEpoch.
ok 6 - diff <= precisionEPICS + precisionNTP
ok 7 - nanosecond overflow throws
ok 8 - undefined => '<undefined>'
ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
ok 10 - '%S.%04f' => '00.0988'
ok 11 - '%S.%05f' => '00.09877'
ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
ok 13 - '%S.%05f' => '00.*'
ok 14 - '%%S.%%05f' => '%S.%05f'
ok 15 - bad format => '<invalid format>'
# Running 10 loops
# 100000 calls to epicsTime:
ok 16 - copy == end
ok 17 - copy <= end
ok 18 - copy >= end
ok 19 - end > begin
ok 20 - end >= begin
ok 21 - begin != end
ok 22 - begin < end
ok 23 - begin <= end
ok 24 - end - end == 0
ok 25 - end - begin ~= diff
ok 26 - begin + 0 == begin
ok 27 - begin + diff == end
ok 28 - end - 0 == end
ok 29 - end - diff == begin
ok 30 - (begin += diff) == end
ok 31 - (end -= diff) == begin
ok 32 - beginANSI + diff == end
ok 33 - beginTS + diff == end
ok 34 - copy == end
ok 35 - copy <= end
ok 36 - copy >= end
ok 37 - end > begin
ok 38 - end >= begin
ok 39 - begin != end
ok 40 - begin < end
ok 41 - begin <= end
ok 42 - end - end == 0
ok 43 - end - begin ~= diff
ok 44 - begin + 0 == begin
ok 45 - begin + diff == end
ok 46 - end - 0 == end
ok 47 - end - diff == begin
ok 48 - (begin += diff) == end
ok 49 - (end -= diff) == begin
ok 50 - beginANSI + diff == end
ok 51 - beginTS + diff == end
ok 52 - copy == end
ok 53 - copy <= end
ok 54 - copy >= end
ok 55 - end > begin
ok 56 - end >= begin
ok 57 - begin != end
ok 58 - begin < end
ok 59 - begin <= end
ok 60 - end - end == 0
ok 61 - end - begin ~= diff
ok 62 - begin + 0 == begin
ok 63 - begin + diff == end
ok 64 - end - 0 == end
ok 65 - end - diff == begin
ok 66 - (begin += diff) == end
ok 67 - (end -= diff) == begin
ok 68 - beginANSI + diff == end
ok 69 - beginTS + diff == end
ok 70 - copy == end
ok 71 - copy <= end
ok 72 - copy >= end
ok 73 - end > begin
ok 74 - end >= begin
ok 75 - begin != end
ok 76 - begin < end
ok 77 - begin <= end
ok 78 - end - end == 0
ok 79 - end - begin ~= diff
ok 80 - begin + 0 == begin
ok 81 - begin + diff == end
ok 82 - end - 0 == end
ok 83 - end - diff == begin
ok 84 - (begin += diff) == end
ok 85 - (end -= diff) == begin
ok 86 - beginANSI + diff == end
ok 87 - beginTS + diff == end
ok 88 - copy == end
ok 89 - copy <= end
ok 90 - copy >= end
ok 91 - end > begin
ok 92 - end >= begin
ok 93 - begin != end
ok 94 - begin < end
ok 95 - begin <= end
ok 96 - end - end == 0
ok 97 - end - begin ~= diff
ok 98 - begin + 0 == begin
ok 99 - begin + diff == end
ok 100 - end - 0 =...
Jeff Hill (johill-lanl) wrote : RE: [Bug 697509] Re: Thread synchronization issue in libCom/osi/os/WIN32/osdTime.cpp | #7 |
Hi Alex,
Yes, it did run on 4 core (8 virtual core) machine, an
Intel windows 7 workstation.
How did you detect a failure?
Jeff
_______
Jeffrey O. Hill Email <email address hidden>
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107
Message content: TSPA
With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925
> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf Of
> Alex Chen
> Sent: Wednesday, January 05, 2011 11:23 PM
> To: <email address hidden>
> Subject: Re: [Bug 697509] Re: Thread synchronization issue in
> libCom/
>
> Did this test run on a multicore machine?
>
> Thanks,
>
> Alex
>
>
>
> From:
> Jeff Hill <email address hidden>
> To:
> <email address hidden>
> Date:
> 01/06/2011 09:45 AM
> Subject:
> [Bug 697509] Re: Thread synchronization issue in
> libCom/
> Sent by:
> <email address hidden>
>
>
> the test results after making that change (I notice no change in the
> test output before or after this change)
>
> C:\hill\
> x86>epicsTimeTe
> 1..195
> ok 1 - epicsEpoch.
> ok 2 - epicsEpoch.
> ok 3 - epicsEpoch.
> ok 4 - epicsEpoch.
> ok 5 - epicsEpoch.
> ok 6 - diff <= precisionEPICS + precisionNTP
> ok 7 - nanosecond overflow throws
> ok 8 - undefined => '<undefined>'
> ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
> ok 10 - '%S.%04f' => '00.0988'
> ok 11 - '%S.%05f' => '00.09877'
> ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
> ok 13 - '%S.%05f' => '00.*'
> ok 14 - '%%S.%%05f' => '%S.%05f'
> ok 15 - bad format => '<invalid format>'
> # Running 10 loops
> # 100000 calls to epicsTime:
> ok 16 - copy == end
> ok 17 - copy <= end
> ok 18 - copy >= end
> ok 19 - end > begin
> ok 20 - end >= begin
> ok 21 - begin != end
> ok 22 - begin < end
> ok 23 - begin <= end
> ok 24 - end - end == 0
> ok 25 - end - begin ~= diff
> ok 26 - begin + 0 == begin
> ok 27 - begin + diff == end
> ok 28 - end - 0 == end
> ok 29 - end - diff == begin
> ok 30 - (begin += diff) == end
> ok 31 - (end -= diff) == begin
> ok 32 - beginANSI + diff == end
> ok 33 - beginTS + diff == end
> ok 34 - copy == end
> ok 35 - copy <= end
> ok 36 - copy >= end
> ok 37 - end > begin
> ok 38 - end >= begin
> ok 39 - begin != end
> ok 40 - begin < end
> ok 41 - begin <= end
> ok 42 - end - end == 0
> ok 43 - end - begin ~= diff
> ok 44 - begin + 0 == begin
> ok 45 - begin + diff == end
> ok 46 - end - 0 == end
> ok 47 - end - diff == begin
> ok 48 - (begin += diff) == end
> ok 49 - (end -= diff) == begin
> ok 50 - beginANSI + diff == end
> ok 51 - beginTS + diff == end
> ok 52 - copy == end
> ok 53 - copy <= end
> ok 54 - copy >= end
> ok 55 - end > begin
> ok 56 - end >= begin
> ok 57 - begin != end
> ok 58 ...
Changed in epics-base: | |
importance: | Undecided → Medium |
assignee: | nobody → Jeff Hill (johill-lanl) |
milestone: | none → 3.14.branch |
status: | New → Confirmed |
Alex Chen (alex-chen) wrote : RE: [Bug 697509] Re: Thread synchronization issue in libCom/osi/os/WIN32/osdTime.cpp | #8 |
Jeff,
We're currently using the portable CA server on PharLap(one realtime
Windows). We're actually using only 4 components: cas.dll, gdd.dll,
ca.dll, Com.dll. We found this thread synchronization during the debugging
of another Bug 697516. We simply put two lines of code like this:
printf(
if ( curPerfCounter.
before line 444 at file base-3.
The currentTime:
could catch one abnormal 'curPerfCounter
case.
Thanks,
Alex
From:
Jeff Hill <email address hidden>
To:
<email address hidden>
Date:
01/07/2011 12:11 AM
Subject:
RE: [Bug 697509] Re: Thread synchronization issue in
libCom/
Sent by:
<email address hidden>
Hi Alex,
Yes, it did run on 4 core (8 virtual core) machine, an
Intel windows 7 workstation.
How did you detect a failure?
Jeff
_______
Jeffrey O. Hill Email <email address hidden>
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107
Message content: TSPA
With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925
> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf Of
> Alex Chen
> Sent: Wednesday, January 05, 2011 11:23 PM
> To: <email address hidden>
> Subject: Re: [Bug 697509] Re: Thread synchronization issue in
> libCom/
>
> Did this test run on a multicore machine?
>
> Thanks,
>
> Alex
>
>
>
> From:
> Jeff Hill <email address hidden>
> To:
> <email address hidden>
> Date:
> 01/06/2011 09:45 AM
> Subject:
> [Bug 697509] Re: Thread synchronization issue in
> libCom/
> Sent by:
> <email address hidden>
>
>
> the test results after making that change (I notice no change in the
> test output before or after this change)
>
> C:\hill\
> x86>epicsTimeTe
> 1..195
> ok 1 - epicsEpoch.
> ok 2 - epicsEpoch.
> ok 3 - epicsEpoch.
> ok 4 - epicsEpoch.
> ok 5 - epicsEpoch.
> ok 6 - diff <= precisionEPICS + precisionNTP
> ok 7 - nanosecond overflow throws
> ok 8 - undefined => '<undefined>'
> ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
> ok 10 - '%S.%04f' => '00.0988'
> ok 11 - '%S.%05f' => '00.09877'
> ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
> ok 13 - '%S.%05f' => '00.*'
> ok 14 - '%%S.%%05f' => '%S.%05f'
> ok 15 - bad format => '<invalid format>'
> # Running 10 loops
> # 100000 calls to epicsTime:
> ok 16 - copy == end
> ok 17 - copy <= end
> ok 18 - copy >= end
> ok 19 - end > begin
> ok 20 - end >= begin
> ok 21 - begin != end
> ok 22 - begin < end
> ok 23 - begin <= end
> ok 24 - end - end == 0
> ok...
Jeff Hill (johill-lanl) wrote : | #9 |
agree with your assessment of situation. Committed this fix to the R3.14 branch (revision12291).
=== modified file 'src/libCom/
--- src/libCom/
+++ src/libCom/
@@ -395,6 +395,8 @@
//
epicsTimerNoti
{
+ EnterCriticalSe
+
// avoid interruptions by briefly becoming a time critical thread
LARGE_INTEGER curFileTime;
LARGE_INTEGER curPerfCounter;
@@ -410,8 +412,6 @@
}
- EnterCriticalSe
-
LONGLONG perfCounterDiff;
if ( curPerfCounter.
no longer affects: | epics-base/3.15 |
Changed in epics-base: | |
status: | Fix Committed → Fix Released |
I'm not sure whether it's a good idea to make calls to some of the routines that this change would encompass inside the critical section, but I defer to Jeff for proper understand of what's going on here; I don't really understand how the PLL code is supposed to work on Win32.
Jeff, could you investigate this on the 3.14 branch version please.
Thanks,
- Andrew