Thread synchronization issue in libCom/osi/os/WIN32/osdTime.cpp

Reported by Alex Chen on 2011-01-05
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
EPICS Base
Medium
Unassigned
3.14
Medium
Jeff Hill

Bug Description

EPICS Base version 3.14.9

File: /src/libCom/osi/os/WIN32/osdTime.cpp

Function: epicsTimerNotify::expireStatus currentTime::expire ( const epicsTime & )

Line 444: if ( curPerfCounter.QuadPart >= this->lastPerfCounter )

Problem: There is a thread synchronization issue which causes the 'curPerfCounter.QuadPart' to be less than the 'this->lastPerfCounter', and we will handle it as a roll over case. But actually this is not a roll over case, the 'this->lastPerfCounter' is changed by another thread which executes the 'void currentTime::getCurrentTime ( epicsTimeStamp & dest ) '

How to fix:
We can move line 441 ' EnterCriticalSection ( & this->mutex );' to the beginning of this function.

Alex Chen (alex-chen) on 2011-01-05
description: updated
Andrew Johnson (anj) wrote :

I'm not sure whether it's a good idea to make calls to some of the routines that this change would encompass inside the critical section, but I defer to Jeff for proper understand of what's going on here; I don't really understand how the PLL code is supposed to work on Win32.

Jeff, could you investigate this on the 3.14 branch version please.

Thanks,

- Andrew

Jeff Hill (johill-lanl) wrote :

if we move the enter critical section up in the function it doesn't help because that part of the code only accesses the auto (function local) variable.

Nevertheless, I do see a bug. This is hopefully a proper fix.

>> bzr diff
=== modified file 'src/libCom/osi/os/WIN32/osdTime.cpp'
--- src/libCom/osi/os/WIN32/osdTime.cpp 2010-10-05 19:27:37 +0000
+++ src/libCom/osi/os/WIN32/osdTime.cpp 2011-01-06 01:27:14 +0000
@@ -409,8 +409,8 @@

     EnterCriticalSection ( & this->mutex );

- LONGLONG perfCounterDiff = curPerfCounter.QuadPart - this->lastPerfCounterPLL;
- if ( curPerfCounter.QuadPart >= this->lastPerfCounter ) {
+ LONGLONG perfCounterDiff;
+ if ( curPerfCounter.QuadPart >= this->lastPerfCounterPLL ) {
         perfCounterDiff = curPerfCounter.QuadPart - this->lastPerfCounterPLL;
     }
     else {

Jeff Hill (johill-lanl) wrote :
Download full text (5.3 KiB)

the test results after making that change (I notice no change in the test output before or after this change)

C:\hill\epicsInBazaar\R3.14\trunk\src\libCom\test\O.WIN32-x86>epicsTimeTest.exe
1..195
ok 1 - epicsEpoch.ansi_tm.tm_sec == 0
ok 2 - epicsEpoch.ansi_tm.tm_min == 0
ok 3 - epicsEpoch.ansi_tm.tm_hour == 0
ok 4 - epicsEpoch.ansi_tm.tm_yday == 0
ok 5 - epicsEpoch.ansi_tm.tm_year == 90
ok 6 - diff <= precisionEPICS + precisionNTP
ok 7 - nanosecond overflow throws
ok 8 - undefined => '<undefined>'
ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
ok 10 - '%S.%04f' => '00.0988'
ok 11 - '%S.%05f' => '00.09877'
ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
ok 13 - '%S.%05f' => '00.*'
ok 14 - '%%S.%%05f' => '%S.%05f'
ok 15 - bad format => '<invalid format>'
# Running 10 loops
# 100000 calls to epicsTime::getCurrent() averaged 0.283 usec each
ok 16 - copy == end
ok 17 - copy <= end
ok 18 - copy >= end
ok 19 - end > begin
ok 20 - end >= begin
ok 21 - begin != end
ok 22 - begin < end
ok 23 - begin <= end
ok 24 - end - end == 0
ok 25 - end - begin ~= diff
ok 26 - begin + 0 == begin
ok 27 - begin + diff == end
ok 28 - end - 0 == end
ok 29 - end - diff == begin
ok 30 - (begin += diff) == end
ok 31 - (end -= diff) == begin
ok 32 - beginANSI + diff == end
ok 33 - beginTS + diff == end
ok 34 - copy == end
ok 35 - copy <= end
ok 36 - copy >= end
ok 37 - end > begin
ok 38 - end >= begin
ok 39 - begin != end
ok 40 - begin < end
ok 41 - begin <= end
ok 42 - end - end == 0
ok 43 - end - begin ~= diff
ok 44 - begin + 0 == begin
ok 45 - begin + diff == end
ok 46 - end - 0 == end
ok 47 - end - diff == begin
ok 48 - (begin += diff) == end
ok 49 - (end -= diff) == begin
ok 50 - beginANSI + diff == end
ok 51 - beginTS + diff == end
ok 52 - copy == end
ok 53 - copy <= end
ok 54 - copy >= end
ok 55 - end > begin
ok 56 - end >= begin
ok 57 - begin != end
ok 58 - begin < end
ok 59 - begin <= end
ok 60 - end - end == 0
ok 61 - end - begin ~= diff
ok 62 - begin + 0 == begin
ok 63 - begin + diff == end
ok 64 - end - 0 == end
ok 65 - end - diff == begin
ok 66 - (begin += diff) == end
ok 67 - (end -= diff) == begin
ok 68 - beginANSI + diff == end
ok 69 - beginTS + diff == end
ok 70 - copy == end
ok 71 - copy <= end
ok 72 - copy >= end
ok 73 - end > begin
ok 74 - end >= begin
ok 75 - begin != end
ok 76 - begin < end
ok 77 - begin <= end
ok 78 - end - end == 0
ok 79 - end - begin ~= diff
ok 80 - begin + 0 == begin
ok 81 - begin + diff == end
ok 82 - end - 0 == end
ok 83 - end - diff == begin
ok 84 - (begin += diff) == end
ok 85 - (end -= diff) == begin
ok 86 - beginANSI + diff == end
ok 87 - beginTS + diff == end
ok 88 - copy == end
ok 89 - copy <= end
ok 90 - copy >= end
ok 91 - end > begin
ok 92 - end >= begin
ok 93 - begin != end
ok 94 - begin < end
ok 95 - begin <= end
ok 96 - end - end == 0
ok 97 - end - begin ~= diff
ok 98 - begin + 0 == begin
ok 99 - begin + diff == end
ok 100 - end - 0 == end
ok 101 - end - diff == begin
ok 102 - (begin += diff) == end
ok 103 - (end -= diff) == begin
ok 104 - beginANSI + diff == end
ok 105 - beginTS + diff == end
ok 106 - copy == end
ok 107 - copy <= end
ok 108 - copy >= end
ok 109 - end > begin
ok 110 - end >= begi...

Read more...

Jeff Hill (johill-lanl) wrote :

BTW: how did you detect that we had an issue

Download full text (3.5 KiB)

Hey Jeff,

Yes, they are all local variables, but the reason to move up the
'EnterCriticalSection' is not to protect race conditions, obviously there
is no race conditions for local variables. The reason is to preserve the
right order to update the 'curPerfCounter' and 'lastPerfCounter', we must
guarantee that the 'lastPerfCounter' is always updated at the end of the
last iteration, and 'curPerfCounter' is updated at the beginning of a new
iteration. But I found this issue:

1. Thread A executes currentTime::expire() from line 425 to line 439, it
has already filled the local variable 'curPerfCounter' (say 10001) and
wait to enter the critical section which is currently obtained by Thread
B.
2. Thread B executes currentTime::getCurrentTime(), it updates the
'lastPerfCounter' (say 10009) at line 402 within the critical section,
then thread B exits the critical section.
3. Thread A gets the chance to enter the critial section just released by
thread B, but it finds the 'lastPerfCounter' is bigger than the
'curPerfCounter', so it will handle it as a roll over case, but it's
actually not a real roll over case.

I found this issue on NI's PharLap platform(a sort of realtime Windows),
but I think it could show up on other platforms as well during concurrency
situations.

BTW, the big headache for me is actually not this bug, but Bug 697516
which throws a std::logic_error exception and causes system to crash.

Thanks,

Alex

From:
Jeff Hill <email address hidden>
To:
<email address hidden>
Date:
01/06/2011 09:45 AM
Subject:
[Bug 697509] Re: Thread synchronization issue in
libCom/osi/os/WIN32/osdTime.cpp
Sent by:
<email address hidden>

if we move the enter critical section up in the function it doesn't help
because that part of the code only accesses the auto (function local)
variable.

Nevertheless, I do see a bug. This is hopefully a proper fix.

>> bzr diff
=== modified file 'src/libCom/osi/os/WIN32/osdTime.cpp'
--- src/libCom/osi/os/WIN32/osdTime.cpp 2010-10-05 19:27:37 +0000
+++ src/libCom/osi/os/WIN32/osdTime.cpp 2011-01-06 01:27:14 +0000
@@ -409,8 +409,8 @@

     EnterCriticalSection ( & this->mutex );

- LONGLONG perfCounterDiff = curPerfCounter.QuadPart -
this->lastPerfCounterPLL;
- if ( curPerfCounter.QuadPart >= this->lastPerfCounter ) {
+ LONGLONG perfCounterDiff;
+ if ( curPerfCounter.QuadPart >= this->lastPerfCounterPLL ) {
         perfCounterDiff = curPerfCounter.QuadPart -
this->lastPerfCounterPLL;
     }
     else {

--
You received this bug notification because you are a direct subscriber
of the bug.
https://bugs.launchpad.net/bugs/697509

Title:
  Thread synchronization issue in libCom/osi/os/WIN32/osdTime.cpp

Status in EPICS Base:
  New

Bug description:
  EPICS Base version 3.14.9

File: /src/libCom/osi/os/WIN32/osdTime.cpp

Function: epicsTimerNotify::expireStatus currentTime::expire ( const
epicsTime & )

Line 444: if ( curPerfCounter.QuadPart >= this->lastPerfCounter )

Problem: There is a thread synchronization issue which causes the
'curPerfCounter.QuadPart' to be less than the 'this->lastPerfCounter', and
we will handle it as a roll over case. But actually this is not a roll
over case, the ...

Read more...

Alex Chen (alex-chen) wrote :
Download full text (6.6 KiB)

Did this test run on a multicore machine?

Thanks,

Alex

From:
Jeff Hill <email address hidden>
To:
<email address hidden>
Date:
01/06/2011 09:45 AM
Subject:
[Bug 697509] Re: Thread synchronization issue in
libCom/osi/os/WIN32/osdTime.cpp
Sent by:
<email address hidden>

the test results after making that change (I notice no change in the
test output before or after this change)

C:\hill\epicsInBazaar\R3.14\trunk\src\libCom\test\O.WIN32-x86>epicsTimeTest.exe
1..195
ok 1 - epicsEpoch.ansi_tm.tm_sec == 0
ok 2 - epicsEpoch.ansi_tm.tm_min == 0
ok 3 - epicsEpoch.ansi_tm.tm_hour == 0
ok 4 - epicsEpoch.ansi_tm.tm_yday == 0
ok 5 - epicsEpoch.ansi_tm.tm_year == 90
ok 6 - diff <= precisionEPICS + precisionNTP
ok 7 - nanosecond overflow throws
ok 8 - undefined => '<undefined>'
ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
ok 10 - '%S.%04f' => '00.0988'
ok 11 - '%S.%05f' => '00.09877'
ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
ok 13 - '%S.%05f' => '00.*'
ok 14 - '%%S.%%05f' => '%S.%05f'
ok 15 - bad format => '<invalid format>'
# Running 10 loops
# 100000 calls to epicsTime::getCurrent() averaged 0.283 usec each
ok 16 - copy == end
ok 17 - copy <= end
ok 18 - copy >= end
ok 19 - end > begin
ok 20 - end >= begin
ok 21 - begin != end
ok 22 - begin < end
ok 23 - begin <= end
ok 24 - end - end == 0
ok 25 - end - begin ~= diff
ok 26 - begin + 0 == begin
ok 27 - begin + diff == end
ok 28 - end - 0 == end
ok 29 - end - diff == begin
ok 30 - (begin += diff) == end
ok 31 - (end -= diff) == begin
ok 32 - beginANSI + diff == end
ok 33 - beginTS + diff == end
ok 34 - copy == end
ok 35 - copy <= end
ok 36 - copy >= end
ok 37 - end > begin
ok 38 - end >= begin
ok 39 - begin != end
ok 40 - begin < end
ok 41 - begin <= end
ok 42 - end - end == 0
ok 43 - end - begin ~= diff
ok 44 - begin + 0 == begin
ok 45 - begin + diff == end
ok 46 - end - 0 == end
ok 47 - end - diff == begin
ok 48 - (begin += diff) == end
ok 49 - (end -= diff) == begin
ok 50 - beginANSI + diff == end
ok 51 - beginTS + diff == end
ok 52 - copy == end
ok 53 - copy <= end
ok 54 - copy >= end
ok 55 - end > begin
ok 56 - end >= begin
ok 57 - begin != end
ok 58 - begin < end
ok 59 - begin <= end
ok 60 - end - end == 0
ok 61 - end - begin ~= diff
ok 62 - begin + 0 == begin
ok 63 - begin + diff == end
ok 64 - end - 0 == end
ok 65 - end - diff == begin
ok 66 - (begin += diff) == end
ok 67 - (end -= diff) == begin
ok 68 - beginANSI + diff == end
ok 69 - beginTS + diff == end
ok 70 - copy == end
ok 71 - copy <= end
ok 72 - copy >= end
ok 73 - end > begin
ok 74 - end >= begin
ok 75 - begin != end
ok 76 - begin < end
ok 77 - begin <= end
ok 78 - end - end == 0
ok 79 - end - begin ~= diff
ok 80 - begin + 0 == begin
ok 81 - begin + diff == end
ok 82 - end - 0 == end
ok 83 - end - diff == begin
ok 84 - (begin += diff) == end
ok 85 - (end -= diff) == begin
ok 86 - beginANSI + diff == end
ok 87 - beginTS + diff == end
ok 88 - copy == end
ok 89 - copy <= end
ok 90 - copy >= end
ok 91 - end > begin
ok 92 - end >= begin
ok 93 - begin != end
ok 94 - begin < end
ok 95 - begin <= end
ok 96 - end - end == 0
ok 97 - end - begin ~= diff
ok 98 - begin + 0 == begin
ok 99 - begin + diff == end
ok 100 - end - 0 =...

Read more...

Download full text (9.1 KiB)

Hi Alex,

Yes, it did run on 4 core (8 virtual core) machine, an
Intel windows 7 workstation.

How did you detect a failure?

Jeff
______________________________________________________
Jeffrey O. Hill Email <email address hidden>
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925

> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf Of
> Alex Chen
> Sent: Wednesday, January 05, 2011 11:23 PM
> To: <email address hidden>
> Subject: Re: [Bug 697509] Re: Thread synchronization issue in
> libCom/osi/os/WIN32/osdTime.cpp
>
> Did this test run on a multicore machine?
>
> Thanks,
>
> Alex
>
>
>
> From:
> Jeff Hill <email address hidden>
> To:
> <email address hidden>
> Date:
> 01/06/2011 09:45 AM
> Subject:
> [Bug 697509] Re: Thread synchronization issue in
> libCom/osi/os/WIN32/osdTime.cpp
> Sent by:
> <email address hidden>
>
>
> the test results after making that change (I notice no change in the
> test output before or after this change)
>
> C:\hill\epicsInBazaar\R3.14\trunk\src\libCom\test\O.WIN32-
> x86>epicsTimeTest.exe
> 1..195
> ok 1 - epicsEpoch.ansi_tm.tm_sec == 0
> ok 2 - epicsEpoch.ansi_tm.tm_min == 0
> ok 3 - epicsEpoch.ansi_tm.tm_hour == 0
> ok 4 - epicsEpoch.ansi_tm.tm_yday == 0
> ok 5 - epicsEpoch.ansi_tm.tm_year == 90
> ok 6 - diff <= precisionEPICS + precisionNTP
> ok 7 - nanosecond overflow throws
> ok 8 - undefined => '<undefined>'
> ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
> ok 10 - '%S.%04f' => '00.0988'
> ok 11 - '%S.%05f' => '00.09877'
> ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
> ok 13 - '%S.%05f' => '00.*'
> ok 14 - '%%S.%%05f' => '%S.%05f'
> ok 15 - bad format => '<invalid format>'
> # Running 10 loops
> # 100000 calls to epicsTime::getCurrent() averaged 0.283 usec each
> ok 16 - copy == end
> ok 17 - copy <= end
> ok 18 - copy >= end
> ok 19 - end > begin
> ok 20 - end >= begin
> ok 21 - begin != end
> ok 22 - begin < end
> ok 23 - begin <= end
> ok 24 - end - end == 0
> ok 25 - end - begin ~= diff
> ok 26 - begin + 0 == begin
> ok 27 - begin + diff == end
> ok 28 - end - 0 == end
> ok 29 - end - diff == begin
> ok 30 - (begin += diff) == end
> ok 31 - (end -= diff) == begin
> ok 32 - beginANSI + diff == end
> ok 33 - beginTS + diff == end
> ok 34 - copy == end
> ok 35 - copy <= end
> ok 36 - copy >= end
> ok 37 - end > begin
> ok 38 - end >= begin
> ok 39 - begin != end
> ok 40 - begin < end
> ok 41 - begin <= end
> ok 42 - end - end == 0
> ok 43 - end - begin ~= diff
> ok 44 - begin + 0 == begin
> ok 45 - begin + diff == end
> ok 46 - end - 0 == end
> ok 47 - end - diff == begin
> ok 48 - (begin += diff) == end
> ok 49 - (end -= diff) == begin
> ok 50 - beginANSI + diff == end
> ok 51 - beginTS + diff == end
> ok 52 - copy == end
> ok 53 - copy <= end
> ok 54 - copy >= end
> ok 55 - end > begin
> ok 56 - end >= begin
> ok 57 - begin != end
> ok 58 ...

Read more...

Jeff Hill (johill-lanl) on 2011-01-06
Changed in epics-base:
importance: Undecided → Medium
assignee: nobody → Jeff Hill (johill-lanl)
milestone: none → 3.14.branch
status: New → Confirmed
Download full text (11.0 KiB)

Jeff,

We're currently using the portable CA server on PharLap(one realtime
Windows). We're actually using only 4 components: cas.dll, gdd.dll,
ca.dll, Com.dll. We found this thread synchronization during the debugging
of another Bug 697516. We simply put two lines of code like this:

printf("curPerfCounter.QuadPart(%I64X), this->lastPerfCounter(%I64X)\n");
if ( curPerfCounter.QuadPart < this->lastPerfCounter ) getchar();

before line 444 at file base-3.14.9\src\libCom\osi\os\WIN32\osdTime.cpp.
The currentTime::expire() is called every 1 second. Within 1 minute, we
could catch one abnormal 'curPerfCounter.QuadPart < this->lastPerfCounter'
case.

Thanks,

Alex

From:
Jeff Hill <email address hidden>
To:
<email address hidden>
Date:
01/07/2011 12:11 AM
Subject:
RE: [Bug 697509] Re: Thread synchronization issue in
libCom/osi/os/WIN32/osdTime.cpp
Sent by:
<email address hidden>

Hi Alex,

Yes, it did run on 4 core (8 virtual core) machine, an
Intel windows 7 workstation.

How did you detect a failure?

Jeff
______________________________________________________
Jeffrey O. Hill Email <email address hidden>
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925

> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf Of
> Alex Chen
> Sent: Wednesday, January 05, 2011 11:23 PM
> To: <email address hidden>
> Subject: Re: [Bug 697509] Re: Thread synchronization issue in
> libCom/osi/os/WIN32/osdTime.cpp
>
> Did this test run on a multicore machine?
>
> Thanks,
>
> Alex
>
>
>
> From:
> Jeff Hill <email address hidden>
> To:
> <email address hidden>
> Date:
> 01/06/2011 09:45 AM
> Subject:
> [Bug 697509] Re: Thread synchronization issue in
> libCom/osi/os/WIN32/osdTime.cpp
> Sent by:
> <email address hidden>
>
>
> the test results after making that change (I notice no change in the
> test output before or after this change)
>
> C:\hill\epicsInBazaar\R3.14\trunk\src\libCom\test\O.WIN32-
> x86>epicsTimeTest.exe
> 1..195
> ok 1 - epicsEpoch.ansi_tm.tm_sec == 0
> ok 2 - epicsEpoch.ansi_tm.tm_min == 0
> ok 3 - epicsEpoch.ansi_tm.tm_hour == 0
> ok 4 - epicsEpoch.ansi_tm.tm_yday == 0
> ok 5 - epicsEpoch.ansi_tm.tm_year == 90
> ok 6 - diff <= precisionEPICS + precisionNTP
> ok 7 - nanosecond overflow throws
> ok 8 - undefined => '<undefined>'
> ok 9 - '%Y-%m-%d %S.%09f' => '1990-01-01 00.098765432'
> ok 10 - '%S.%04f' => '00.0988'
> ok 11 - '%S.%05f' => '00.09877'
> ok 12 - '%S.%05f %S.%05f' => '00.09877 00.09877'
> ok 13 - '%S.%05f' => '00.*'
> ok 14 - '%%S.%%05f' => '%S.%05f'
> ok 15 - bad format => '<invalid format>'
> # Running 10 loops
> # 100000 calls to epicsTime::getCurrent() averaged 0.283 usec each
> ok 16 - copy == end
> ok 17 - copy <= end
> ok 18 - copy >= end
> ok 19 - end > begin
> ok 20 - end >= begin
> ok 21 - begin != end
> ok 22 - begin < end
> ok 23 - begin <= end
> ok 24 - end - end == 0
> ok...

Jeff Hill (johill-lanl) wrote :

agree with your assessment of situation. Committed this fix to the R3.14 branch (revision12291).

=== modified file 'src/libCom/osi/os/WIN32/osdTime.cpp'
--- src/libCom/osi/os/WIN32/osdTime.cpp 2011-11-11 00:22:22 +0000
+++ src/libCom/osi/os/WIN32/osdTime.cpp 2011-11-11 00:31:27 +0000
@@ -395,6 +395,8 @@
 //
 epicsTimerNotify::expireStatus currentTime::expire ( const epicsTime & )
 {
+ EnterCriticalSection ( & this->mutex );
+
     // avoid interruptions by briefly becoming a time critical thread
     LARGE_INTEGER curFileTime;
     LARGE_INTEGER curPerfCounter;
@@ -410,8 +412,6 @@
         curFileTime.HighPart = ft.dwHighDateTime;
     }

- EnterCriticalSection ( & this->mutex );
-
     LONGLONG perfCounterDiff;
     if ( curPerfCounter.QuadPart >= this->lastPerfCounterPLL ) {
         perfCounterDiff = curPerfCounter.QuadPart - this->lastPerfCounterPLL;

Andrew Johnson (anj) on 2011-12-12
no longer affects: epics-base/3.15
Andrew Johnson (anj) on 2011-12-12
Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers