CA monitoring of waveforms is unreliable because values are not buffered

Bug #1528812 reported by Ambroz Bizjak
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Won't Fix
Wishlist
Ralph Lange

Bug Description

When monitoring a waveform of more than one element, the IOC will not queue values in the event queue. Instead when an event is generated (db_post_single_event_private), only the event thread will be woken up, which will later send the latest value in the record to the client (read_reply() in camessage.c will be called with null pfl). By that time, the value in the waveform record might have changed.

This limitation is mentioned in the code, see http://bazaar.launchpad.net/~epics-core/epics-base/3.14/view/head:/src/db/dbEvent.c#L440 . However, I have not actually seen it documented anywhere outside the code (especially not in the CA reference manual), so I imagine many people would have an issue with this.

In our case, we are using a waveform record with asynchronous processing to send commands to a PLC and receive the responses. A client issues a CA-Put-Notify to the waveform with the request. The device support sends the request to the PLC and sets PACT, and when a response is later received, it completes asynchronous processing, writing the response data to that same record. The client which has sent a request monitors this record and recognizes the response to its own command (and ignores responses to commands issued by other clients).

Because we are using Put-Notify, we expected that this would work reliably even when multiple clients are sending commands at around the same time, since the requests would be serialized (as implemented in dbNotify.c). This does indeed happen, but then we are seeing problems where we issue two requests at the same time, and when monitoring this PV, in place where we should see the data for the first response, we instead see the data for the second request (followed by the second response). Apparently what has happened is that the queued up second request managed to start (and be written into the record) before the monitor event for the first response was picked up by the event thread.

Changed in epics-base:
status: New → Confirmed
assignee: nobody → Ralph Lange (ralph-lange)
importance: Undecided → Wishlist
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

The behavior you describe is a result of the fact that, in the process database, array data is stored in the record struct directly, which entails making copies for buffering. The dbfl_type_rec references are an attempt to avoid making some copies. While arguably not documented clearly enough (what is) this is a widely known, and much lamented, behavior.

For Base <=3.14 series that's the end of the story.

With >=3.15 the server side filters feature offers a way to force the extra copy being avoided, and get buffering as with scalar values. I'll see about putting together an example.

Base series 3.15 also adds the put-process-get feature, which I think may be a better fit, but I don't know as much about.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

The event mechanism is part of the EPICS Database, its sources are located in .../src/db.

While being obviously closely related to Channel Access, it is clearly outside of CA's scope. There are other implementations of Channel Access servers (e.g. the Gateway, pcaspy, ...) where this does not apply.
Thus I don't see the immanent need for this behavior to be documented in the CA Reference Manual.

The Application Developers Guide (being the main documentation for the IOC and the proper place) mentions it in the last paragraph of its chapter "Channel Access Monitors" (15.6).

This behavior is widely known, and indeed: I don't know any user that particularly likes it.

However, "fixing" it is a strictly non-trivial task. Buffering arrays (think GB-size images) can introduce serious resource issues on legacy, small or embedded systems. Even for an IOC on a recent system, a network hickup on a fast connection could blow up the server in seconds if it buffered the image streams.

Such drastic changes in a very central part of the code (used many times a second on every existing EPICS IOC) will definitely not happen on the "stable" 3.14 release series.

EPICS 3.15 adds the "server side plugin" framework that allows user code plugins to be pulled into the event stream between the DB and Channel Access. This framework is especially designed to allow adding such features, without changing the behavior for all existing uses.

I'd be happy to review a 3.15 server-side plugin for inclusion in Base, once you have it working.

Revision history for this message
Ambroz Bizjak (ambrop7) wrote :

Hi Michael, Ralph,

Thank you for your comments. We'll try to figure out how to solve the issue for our project, and if it will involve changes in Base we'll be sure to send them to you.

Best regards,
Ambroz

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Hi Ambros,

Based on similar situations...
The easiest workaround might be adding a fixed delay after an answer from the PLC is received. This is a crappy fix, I know, as it fights a dynamic issue with a fixed time, and there are always worse situations (e.g. network hickups) where it still fails. But if you can afford the additional delay in serializing a parallel access situation, it may help in >99% of the situations without much effort. (Make sure to add the delay *after* the transaction to avoid adding latency to every access.)

But I would really encourage you to write a 3.15 server-side plugin that adds a queue. The existing plugins in base (notably deadband and sub-array) should give you a good idea of what you will have to do. Should be doable in a few hundred lines of code.
Once it works for array and scalar data, and the client can configure the queue depth it wants to use, it's definitely interesting for inclusion in base.

Andrew Johnson (anj)
Changed in epics-base:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.