Comment 4 for bug 1868486

Revision history for this message
rivers (rivers) wrote :

I understand that Michael does not have the time to fix this problem. But I really think it should be fixed. While there is no code in base 7.0.3.1 that uses the timeout variant of epicsMessageQueue::receive() there are plenty of places in epics-modules and areaDetector that do:

corvette:~/devel>find . -name '*.c*' -type f -exec grep -H ReceiveWithTimeout {} \;
./mca-7-8/mcaApp/CanberraSrc/nmc_comm_subs_1.c: len = epicsMessageQueueReceiveWithTimeout(m->responseQ, pkt, sizeof(*pkt), (double)(i->timeout_time/1000.));
./asyn-4-39/asyn/drvAsynUSBTMC/drvAsynUSBTMC.c: s = epicsMessageQueueReceiveWithTimeout(
./ipUnidig-2-11/ipUnidigApp/src/drvIpUnidig.cpp: status = epicsMessageQueueReceiveWithTimeout(msgQId_,
./autosave-5-10/asApp/src/save_restore.c: while (epicsMessageQueueReceiveWithTimeout(opMsgQueue, (void*) &msg, OP_MSG_SIZE, (double)MIN_DELAY) >= 0) {
./caputRecorder-1-7-2/caputRecorderApp/src/caputRecorder.c: msg_size = epicsMessageQueueReceiveWithTimeout(caputRecorderMsgQueue, (void *)pmsg, MSG_SIZE, 5.0);
./softGlue-2-8-2/softGlueApp/src/drvIP_EP201.c: if (epicsMessageQueueReceiveWithTimeout(pPvt->msgQId,
./areaDetector-3-9/aravisGigE/aravisGigEApp/src/aravisCamera.cpp: if (epicsMessageQueueReceiveWithTimeout(this->msgQId, &buffer, sizeof(&buffer), 0.005) == -1) {
./areaDetector-3-9/ADAravis/aravisApp/src/ADAravis.cpp: if (epicsMessageQueueReceiveWithTimeout(this->msgQId, &buffer, sizeof(&buffer), 0.005) == -1) {

This is where I believe epicsMessageQueue::receive() with a timeout is used:

corvette:~/devel>find . -name '*.c*' -type f -exec grep -H 'receive(' {} \;
./dante-1-0/danteApp/danteSrc/dante.cpp: numRecv = msgQ_->receive(&message, sizeof(message), MESSAGE_TIMEOUT);
./areaDetector-3-9/ADCore/ADApp/pluginSrc/NDPluginDriver.cpp: numBytes = pFromThreadMsgQ_->receive(&fromMsg, sizeof(fromMsg), 2.0);
./areaDetector-3-9/ADCore/ADApp/pluginSrc/NDPluginDriver.cpp: numBytes = pFromThreadMsgQ_->receive(&fromMsg, sizeof(fromMsg), 2.0);

The bug is such that if the time between messages is ever the same as the timeout then there is significant probability that the message will be lost. This can lead to unreliable behavior and difficult to track down problems.

I spent 2 days isolating this problem to EPICS base. Since the symptom was that frames were being dropped on a FLIR camera my first suspects were the hardware, the areaDetector driver, and the vendor library respectively. It took a long time to realize the problem was with EPICS base.