epicsMessageQueue lost messages

Bug #1868486 reported by mdavidsaver
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Status tracked in 7.0
3.15
Fix Released
Undecided
Andrew Johnson
7.0
Fix Released
High
Andrew Johnson

Bug Description

https://epics.anl.gov/core-talk/2020/msg00396.php

Mark Rivers observed epicsMessageQueue losing messages.

https://epics.anl.gov/core-talk/2020/msg00408.php

> I think I see the logic error in how the eventSent flag is handled,
> specifically related to the fact that epicsEvent is a semaphore as
> opposed to a condition variable.
>
> This allows a "race" to occur if the first/only waiting receiver
> times out, and epicsEventWaitWithTimeout() returns, while a sender
> is in epicsMessageQueueSend() preparing to wake up a receiver.
>
> This results in a situation where the sender has set the eventSent,
> and indeed copied a message to the buffer of, a thread which has
> decided to abort.
>
> After timing out, the receiver sees the timeout and returns -1
> even through eventSent has been set. This can be trapped with:
>
> > b osdMessageQueue.cpp:358 if status!=0 && threadNode.eventSent
>
> So here is your lost message.
>
> Now when epicsMessageQueueReceiveWithTimeout() is called again, no message
> is waiting in the queue, so epicsEventWaitWithTimeout() is called.
> Since the semaphore is already set, this returns immediately with status==0,
> but this is a spurious wakeup and the eventSent flag is not set.
>
> And here is the second "timeout".

Line numbers circa 7.0.3.1

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

This issue has been present in all versions of epicsMessageQueue.

I see two fixes which could combine to address this issue in myReceive().
Similar changes will likely be required in mySend() as it also supports timeout.

1. Return success in threadNode.eventSent regardless of timeout status.

2. Handle the case where getEventNode() returns an epicsEvent which
   has already been triggered. Maybe getEventNode() calls epicsEventTryWait() ?

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I'm not planning to take the time to fix this issue.

Given that this is such a long standing issue, another course of action would be to remove/deprecate the *WithTimeout() variants for send and receive? Maybe leaving stubs which ignore the timeout?

Revision history for this message
Andrew Johnson (anj) wrote :

> This issue has been present in all versions of epicsMessageQueue.

Correction: This is issue is present in all of the "default" implementations of osdMessageQueue. We don't know whether it's present in the VxWorks or RTEMS implementations, which are fundamentally different code.

Note: If anyone starts working on this, please do so on the 3.15 branch of Base which is where we'll want to commit any fixes.

I will unfortunately not able to run tests on VxWorks until Argonne moves back from its current min-safe operating condition as my office VxWorks systems are turned off and I'm not allowed into the lab to turn them on. It might be worthwhile to try runing Mark's test program on the RTEMS-qemu target if someone has the desire to do that (or to merge his code into epicsMessageQueueTest.cpp and run that if it's easier that way).

Revision history for this message
rivers (rivers) wrote :

I understand that Michael does not have the time to fix this problem. But I really think it should be fixed. While there is no code in base 7.0.3.1 that uses the timeout variant of epicsMessageQueue::receive() there are plenty of places in epics-modules and areaDetector that do:

corvette:~/devel>find . -name '*.c*' -type f -exec grep -H ReceiveWithTimeout {} \;
./mca-7-8/mcaApp/CanberraSrc/nmc_comm_subs_1.c: len = epicsMessageQueueReceiveWithTimeout(m->responseQ, pkt, sizeof(*pkt), (double)(i->timeout_time/1000.));
./asyn-4-39/asyn/drvAsynUSBTMC/drvAsynUSBTMC.c: s = epicsMessageQueueReceiveWithTimeout(
./ipUnidig-2-11/ipUnidigApp/src/drvIpUnidig.cpp: status = epicsMessageQueueReceiveWithTimeout(msgQId_,
./autosave-5-10/asApp/src/save_restore.c: while (epicsMessageQueueReceiveWithTimeout(opMsgQueue, (void*) &msg, OP_MSG_SIZE, (double)MIN_DELAY) >= 0) {
./caputRecorder-1-7-2/caputRecorderApp/src/caputRecorder.c: msg_size = epicsMessageQueueReceiveWithTimeout(caputRecorderMsgQueue, (void *)pmsg, MSG_SIZE, 5.0);
./softGlue-2-8-2/softGlueApp/src/drvIP_EP201.c: if (epicsMessageQueueReceiveWithTimeout(pPvt->msgQId,
./areaDetector-3-9/aravisGigE/aravisGigEApp/src/aravisCamera.cpp: if (epicsMessageQueueReceiveWithTimeout(this->msgQId, &buffer, sizeof(&buffer), 0.005) == -1) {
./areaDetector-3-9/ADAravis/aravisApp/src/ADAravis.cpp: if (epicsMessageQueueReceiveWithTimeout(this->msgQId, &buffer, sizeof(&buffer), 0.005) == -1) {

This is where I believe epicsMessageQueue::receive() with a timeout is used:

corvette:~/devel>find . -name '*.c*' -type f -exec grep -H 'receive(' {} \;
./dante-1-0/danteApp/danteSrc/dante.cpp: numRecv = msgQ_->receive(&message, sizeof(message), MESSAGE_TIMEOUT);
./areaDetector-3-9/ADCore/ADApp/pluginSrc/NDPluginDriver.cpp: numBytes = pFromThreadMsgQ_->receive(&fromMsg, sizeof(fromMsg), 2.0);
./areaDetector-3-9/ADCore/ADApp/pluginSrc/NDPluginDriver.cpp: numBytes = pFromThreadMsgQ_->receive(&fromMsg, sizeof(fromMsg), 2.0);

The bug is such that if the time between messages is ever the same as the timeout then there is significant probability that the message will be lost. This can lead to unreliable behavior and difficult to track down problems.

I spent 2 days isolating this problem to EPICS base. Since the symptom was that frames were being dropped on a FLIR camera my first suspects were the hardware, the areaDetector driver, and the vendor library respectively. It took a long time to realize the problem was with EPICS base.

Revision history for this message
Andrew Johnson (anj) wrote :

Hi Mark, I agree that this should be fixed, it's just a matter of who can do it and when. That's why I marked this as a High priority bug.

Do you have the ability to run code on VxWorks at the moment? I don't since my office VME crate is turned off. The VxWorks targets use a completely different implementation based on the VxWorks API that I believe Marty used as the basis for the API for epicsMessageQueue, and RTEMS has a different one again, so I suspect your test might succeed on VxWorks. If you wanted to add your tests to epicsMessageQueueTest.c we'd certainly appreciate the help, and that might speed up someone fixing this.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

As an optimization, the (os/default/) epicsMessageQueue code will, when one party (send or recv) finds that the other party is always waiting, copy directly to the waiter's buffer. This avoids copying messages twice, but comes at a price in terms of complexity. This optimization drives the need for each party to be tracked individually, and to have a separate epicsEvent.

Given that the usage I've seen is with small messages, I wonder whether this optimization is worthwhile. Eliminating it would allow the use of a fewer (probably two) epicsEvent, and avoid the need to track parties individually. (could be similar to what I've done with epicsThreadPool)

When fixing this issue, it would be worth thinking about a simplification.

Revision history for this message
rivers (rivers) wrote :

Hi Andrew,

Thanks for your comment, I only just saw it. For some reason when other people add comments here (e.g. Michael) I receive an e-mail about it. But when you write a comment I do not receive an e-mail. I have seen the same thing previously. Any idea why?

> Do you have the ability to run code on VxWorks at the moment?

Yes, I can test on vxWorks.

I just came up with a very simple patch that almost fixes the problem.

diff --git a/modules/libcom/src/osi/os/default/osdMessageQueue.cpp b/modules/libcom/src/osi/os/default/osdMessageQueue.cpp
index c86d8cc..e0d10be 100644
--- a/modules/libcom/src/osi/os/default/osdMessageQueue.cpp
+++ b/modules/libcom/src/osi/os/default/osdMessageQueue.cpp
@@ -339,8 +339,10 @@ myReceive(epicsMessageQueueId pmsg, void *message, unsigned int size,
     epicsMutexUnlock(pmsg->mutex);

     epicsEventStatus status;
- if (timeout > 0)
+ if (timeout > 0) {
         status = epicsEventWaitWithTimeout(threadNode.evp->event, timeout);
+ if (status != epicsEventWaitOK) status = epicsEventTryWait(threadNode.evp->event);
+ }
     else
         status = epicsEventWait(threadNode.evp->event);

This allows for the fact that there is a very brief window when the timeout can occur but an event has actually been signaled. It just calls epicsEventTryWait to detect this case.

Here is the output of running the attached test program 10,000 loops, delay time=0.01 before the patch.
Loops=10000, numSent=10000, numReceived=9770

So it failed 2.3% of the time.

Here is the output of running the test program 10,000 loops after the patch:
Loops=10000, numSent=10000, numReceived=9999

So it failed 0.01% of the time, or 230 times less than before the patch. But unfortunately it did fail, so it is not a solution.

Revision history for this message
Ben Franksen (bfrk) wrote :

I think the bug is to rely on the return value of epicsEventWaitWithTimeout. Try the following patch. A similar fix may be needed in function mySend.

diff --git a/modules/libcom/src/osi/os/default/osdMessageQueue.cpp b/modules/libcom/src/osi/os/default/osdMessageQueue.cpp
index c86d8cc2b..eb615a8d0 100644
--- a/modules/libcom/src/osi/os/default/osdMessageQueue.cpp
+++ b/modules/libcom/src/osi/os/default/osdMessageQueue.cpp
@@ -352,8 +352,9 @@ myReceive(epicsMessageQueueId pmsg, void *message, unsigned int size,

     epicsMutexUnlock(pmsg->mutex);

- if (threadNode.eventSent && (threadNode.size <= size) &&
- status == epicsEventOK)
+ // even if status == epicsEventTimeout it may still be that we
+ // have received a message
+ if (threadNode.eventSent && threadNode.size <= size && status != epicsEventError)
         return threadNode.size;
     return -1;
 }

Revision history for this message
rivers (rivers) wrote :

I have modified libcom/test/epicsMessageQueueTest.cpp to add 2 new tests. Both tests send 1000 messages to a new receiver thread. The receiver thread calls q->receive() with a 0.01 second timeout. In the first test there is an epicsThreadSleep of 0.011 seconds in the q->send() loop. In the second test the sleep is 0.010 seconds.

These are the important results:
# Single receiver with timeout, single sender with sleep tests:
# These tests last 20 seconds ...
ok 57 - sleep=0.011
not ok 58 - sleep=0.010
# numSent should be 1000, actual=1000, numReceived should be 1000, actual=935

Revision history for this message
rivers (rivers) wrote :

This is my patch to epicsTestMessageQueue.cpp. I could not put in previous message because it only allows one attachment.

Note that I had to add a second message queue, because my receiver thread would interfere with subsequent tests on the q1 queue.

Revision history for this message
rivers (rivers) wrote :

I think the mail problem is solved. I was not subscribed to receive e-mail about this bug. I have fixed that.

Revision history for this message
Ben Franksen (bfrk) wrote :

I have applied your test patch. It fails on my machine as expected, the results are even worse:not

ok 58 - sleep=0.010
# numSent should be 1000, actual=1000, numReceived should be 1000, actual=246

With my very simple patch the test succeeds.

Revision history for this message
Ben Franksen (bfrk) wrote :

Sorry for the botched pasting.

Before:

# These tests last 20 seconds ...
ok 57 - sleep=0.011
not ok 58 - sleep=0.010
# numSent should be 1000, actual=1000, numReceived should be 1000, actual=246

With my patch:

# These tests last 20 seconds ...
ok 57 - sleep=0.011
ok 58 - sleep=0.010

Revision history for this message
rivers (rivers) wrote :

I also tested your simple patch with the modified epicsMessageQueueTest.cpp and it succeeded for me as well.

I then tested with the test program I attached previously. I tested with 50,000 loops, 0.01 second delay, which takes 500 seconds. This also succeeds with Ben's patch, while it had a 3-5% failure rate with the 7.0.3.1 release.

Loops=50000, numSent=50000, numReceived=50000

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Does Ben's change also eliminate the spurious timeouts?

Revision history for this message
rivers (rivers) wrote :

> Does Ben's change also eliminate the spurious timeouts?

Yes, I believe so. I know only see single timeouts, never 2 in a row. These single timeouts are normal because for this test the sleep time of the sender and the receive timeout are both 0.01 seconds, so there is some probably of a real timeout. When receive size is -1 now (timeout) the value is the previous value, it is never the next value.

2020/03/27 08:03:49.266 sent value=964
2020/03/27 08:03:49.266 received value=964
2020/03/27 08:03:49.276 sent value=965
2020/03/27 08:03:49.277 received value=965
2020/03/27 08:03:49.287 sent value=966
2020/03/27 08:03:49.287 received value=966
2020/03/27 08:03:49.297 sent value=967
2020/03/27 08:03:49.297 received value=967
2020/03/27 08:03:49.307 receive size=-1 (value=967)
2020/03/27 08:03:49.307 received value=968
2020/03/27 08:03:49.307 sent value=968
2020/03/27 08:03:49.317 receive size=-1 (value=968)
2020/03/27 08:03:49.317 sent value=969
2020/03/27 08:03:49.317 received value=969
2020/03/27 08:03:49.327 sent value=970
2020/03/27 08:03:49.327 received value=970
2020/03/27 08:03:49.337 receive size=-1 (value=970)
2020/03/27 08:03:49.337 received value=971
2020/03/27 08:03:49.337 sent value=971
2020/03/27 08:03:49.347 receive size=-1 (value=971)
2020/03/27 08:03:49.347 sent value=972
2020/03/27 08:03:49.347 received value=972
2020/03/27 08:03:49.357 receive size=-1 (value=972)
2020/03/27 08:03:49.357 received value=973
2020/03/27 08:03:49.357 sent value=973
2020/03/27 08:03:49.367 receive size=-1 (value=973)
2020/03/27 08:03:49.368 sent value=974
2020/03/27 08:03:49.368 received value=974
2020/03/27 08:03:49.378 sent value=975
2020/03/27 08:03:49.378 received value=975
2020/03/27 08:03:49.388 sent value=976
2020/03/27 08:03:49.388 received value=976
2020/03/27 08:03:49.398 sent value=977
2020/03/27 08:03:49.398 received value=977
2020/03/27 08:03:49.408 sent value=978
2020/03/27 08:03:49.408 received value=978
2020/03/27 08:03:49.418 sent value=979
2020/03/27 08:03:49.418 received value=979
2020/03/27 08:03:49.428 sent value=980
2020/03/27 08:03:49.428 received value=980
2020/03/27 08:03:49.428 receive size=-1 (value=980)
2020/03/27 08:03:49.438 sent value=981
2020/03/27 08:03:49.438 received value=981
2020/03/27 08:03:49.448 sent value=982
2020/03/27 08:03:49.448 received value=982

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Where was Ben's patch posted? I remember reading it, but I don't see a Merge Request or an attachment to this issue. And I can't find it in my old email.

Revision history for this message
rivers (rivers) wrote :

Ben's patch is in message #8 in this bug.

Revision history for this message
Andrew Johnson (anj) wrote :

Mark's tests fail for me as expected when added to the 3.15 branch on both MacOS and Linux, but not on RTEMS-pc386-qemu. RTEMS uses a different implementation of osdMessageQueue, as does VxWorks. Applying a reversed version of Ben's patch then results in some build warnings (status is never used) which I will clean up, but no test failures.

In comment #8 above Ben suggested:

> A similar fix may be needed in function mySend.

My guess is that the equivalent situation for sends would be that the queue is usually full, and the timeout to a send request that is waiting to be queues expires at about the same time that the receiver takes an item off and wakes up the sender. This is probably a much less common scenario, but until someone writes a test that proves that we have a bug there I don't particularly want to make the other change.

For now I will not do any more than is described above. Additional tests proving the other change is also required would be welcomed.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> Ben's patch is in message #8 in this bug.

For some reason I remembered it being larger.

This patch is not a complete solution for the RX side issue as it still
allows triggered/signaled epicsEvents to be returned to the free-list.
So the spurious timeouts will continue. It doesn't address the TX
side issues at all. @Andrew, the TX side has the same design, and thus
the same problems.

Still, this patch is a definite improvement, and I think worth applying.

Revision history for this message
Andrew Johnson (anj) wrote :

IIUC the fix to the TX-side issue should look something like the attached patch. We don't have any tests that express the fault that this should fix, but the result does (still) pass epicsMessageQueueTest.

However I spent yesterday afternoon looking at osi/os/default/osdMessageQueue.cpp properly and I have a more extensive solution that I think fixes the remaining issues too. I wrote some tests to check the TX-side problems but unfortunately they don't seem to trigger any bugs other than possibly in the test code itself. I will put together a PR with what I have so far in case anyone else wants to take a look.

Revision history for this message
Andrew Johnson (anj) wrote :
Revision history for this message
Andrew Johnson (anj) wrote :

Merged PR.

Revision history for this message
Andrew Johnson (anj) wrote :

AFAIK the Github PR #73 fixed all the faults we could find in the original code, and added tests. This PR was merged into 3.15.8 and marked as Fix Released there; I just merged the 3.15 branch into 7.0 so this fix has been incorporated and I'm adding the 7.0.3.2 milestone back again.

Unfortunately the RTEMS implementation of osdMessageQueue.cpp seems to be failing the new epicsMessageQueue tests, but that's a different issue since it's a different implementation...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.