WIN32: IOC scan loops hang

Bug #1896295 reported by Freddie Akeroyd
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Critical
Unassigned

Bug Description

The IOC scan loops hung on all IOCs on several computers, on investigation this looks to be caused by the monotonic clock wrapping. The values of next and now in dbScan.c / periodicTask() are consistent with GetTickCount() wrapping (which happens every 49.7 days), which I believe lead to epicsEventWaitWithTimeout() being called with a large delay. I was initially confused by the debugger saying a delay of 0 had been passed to epicsEventWaitWithTimeout(), but I cannot see how this is possible (the minimum delay should be penalty) so I assume the debugger is reporting incorrectly due to the code being in release mode.

The epics Monotonic time functions should not normally be using GetTickCount(), however it looks like the logic in the test for whether to use GetPerformanceCounter() is inverted and so GetTickCount() is always used. I will push a patch to fix this, and also to use GetTickCount64() as a fallback instead (which wraps on a much longer timescale).

Related branches

Revision history for this message
rivers (rivers) wrote :

I observed an a Windows areaDetector IOC not updating the ArrayRate_RBV records when it should. It appeared that the 1 second scan was not running. Restarting the IOC fixed the problem. This is consistent with this bug

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Use of GetTickCount64() seems reasonable to me, it does close the door on XP compatibility though.

I guess this is an preview of 2038. So it may be worthwhile to investigate if epicsTimeDiffInSeconds() can be made to handle rollover reasonably.

Revision history for this message
Andrew Johnson (anj) wrote :

Hi Freddie, which version of Base are you using here? I thought the code that used the monotonic clock in the internal timer queues never got into a released version, but I just found that dbScan.c is calling epicsTimeGetMonotonic() in its periodic scan code (from commit 8b9ad212c4eb43d) so it seems I was wrong, that code has been in all releases since 7.0.3.1.

Revision history for this message
Freddie Akeroyd (freddie-akeroyd) wrote :

I have added GetTickCount64() wrapped in a _WIN32_WINNT check, so it is still possible to compile for Windows XP by defining _WIN32_WINNT as 0x0501 externally

Revision history for this message
Freddie Akeroyd (freddie-akeroyd) wrote :

We are running base 7.0.3.1

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> ... I thought the code that used the monotonic clock in the internal timer queues never got into a released version ...

dbScan.c was the one usage I still had confidence in, so I left it. sigh...

epicsTimeDiffInSeconds() only tries to handle rollover >= `ULONG_MAX/2`. Using it with a time source which rolls over sooner is wrong, and likely unfixable.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

And for that matter, use of `ULONG_MAX` seems wrong on targets where `sizeof(long)==8`. eg. if:

> epicsTimeStamp A = {0xffffffff, 0};
> epicsTimeStamp B = {0x00000001, 0};

Then epicsTimeDiffInSeconds(&B, &A) should be 2.0 right?

On Linux, I get -4294967294.0

Down the rabbit hole we go...

Revision history for this message
Freddie Akeroyd (freddie-akeroyd) wrote :

QueryPerformanceCounter() is likely to work on Windows 2000 and above, and guaranteed to work on Windows XP and above so I have removed use of either GetTickCount() or GetTickCount64() in the PR

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Cross-linking: epicsTime: fix overflow handling

https://github.com/epics-base/epics-base/pull/92

Revision history for this message
Andrew Johnson (anj) wrote :

The problem of scan threads hanging was fixed by reversion of the use of monotonic time for timers, but the underlying issue with the monotonic time support on Windows has now been fixed by Freddie's branch.

Changed in epics-base:
status: New → Fix Committed
Revision history for this message
Andrew Johnson (anj) wrote :

My previous comment was incorrect, this bug has been present since 7.0.3.1 and should finally be fixed in the up-coming 7.0.5 release.

Changed in epics-base:
milestone: none → 7.0.5
importance: Undecided → Critical
Andrew Johnson (anj)
Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.