EPICS 3.13 clients cannot connect to EPICS 7 or 3.16

Bug #1971737 reported by Dirk Zimoch
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
New
Undecided
Unassigned

Bug Description

Since commit 95fd255d "rsrv: ignore CA client version older than v4.4" from Jul 30 2016, EPICS 3.13 clients cannot connect any more, even though their CA version is V4.8.

The reason is that camessage drops the incoming CA_PROTO_SEARCH messages because the 3.13 client has not yet sent a CA_PROTO_VERSION message and thus the code reads client->minor_version_number as 0.

This is a huge hindrance for upgrading 3.14 IOCs to EPICS 7 when there are still 3.13 IOC (which cannot be upgraded yet) in the network which need to connect.

description: updated
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

This issue was discussed on tech-talk in March 2022.

https://epics.anl.gov/tech-talk/2022/msg00297.php

Wherein I reference a call for test results I made in March 2021 (PSI wasn't among the responders).

https://epics.anl.gov/tech-talk/2021/msg00484.php

And yes, going back to one of my changes merged in 2017

https://code.launchpad.net/~epics-core/epics-base/camodern/+merge/306371

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> ... which cannot be upgraded yet ...

@Dirk, Are you able to talk about PSI's plan and time table to upgrade or decommission these <= 3.13 IOCs?

Speaking only for myself. I would be more willing to spend time on these sorts of issues if I knew there was an end date.

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

The 3.13 IOCs are very old MV2300s with only 32MB RAM of which 28 are in use. The larger record size in 3.14+ is overloading them. There are plans to remove them in the SLS-2 upgrade in a few years, so nobody is interested to upgrade their hardware right now. But I would like to migrate as many other IOCs from 3.14 to 7 before the SLS-2 upgrade.
I was thinking of reverting those parts of your modifications that prevent the 3.13 IOCs from connecting. The commit message suggests that it was not your intention to remove support for 3.13 clients but for much older versions (3.12? 2.x?).
I consider doing that during the codeathon.

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

I just read the tech-talk post. I have no package capture but I have added printf debugging to camessage. I can see that the 3.14+ search requests contain a CA_PROTO_VERSION message first before the CA_PROTO_SEARCH. 3.13 host tools (ca_test) start straight with CA_PROTO_SEARCH. Their requests get „silently ignored“. I plan to check if a 3.13 IOC does the same. I guess so as the links never connect and the EPICS 7 IOC prints no „to old“ message but stays silent.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> ... I have no package capture but I have added printf debugging to camessage. ...

Could you record some? I would help me (and others) to avoid further unintended breakage if there were some "as built" records of CA from the era before IOCs could run on Linux. Otherwise, I'm left using the conditionals in the source code to guess how packets were structured.

> ... There are plans to remove them in the SLS-2 upgrade in a few years ...

Were I in your place, I would move these ancient IOCs behind a cagateway (built against 3.14) and then try my best to forget about them.

Alternately, maybe we can collect donations next week to start a 3.13 retirement fund?

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

If you need a 3.13 client, you can compile a fresh 3.13.10 host installation. Is fun. I did it recently. (Together with a target compilation for VxWorks 6.9. Even more fun.)

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> ... you can compile a fresh 3.13.10 host installation. Is fun. ...

I'll leave this archeology to those who already have a pick in hand. I have enough fun with < 3.14.12 . Last I checked, this was the oldest release which I could build without manual patching.

Anyway, the traffic I'm most interested in recording is between an old client an an old IOC.

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :
Download full text (6.2 KiB)

This is the log of an EPICS 7.0.6 IOC with 'var CASDEBUG 3' when contacted by the ca_test unility:
7.0.6 client:
CAS: cast server msg of 48 bytes from addr 127.0.0.1:33835
CAS: Request from 127.0.0.1:33835 => cmmd=0 cid=0x1 type=1 count=13 postsize=0
CAS: Request from 127.0.0.1:33835 => available=0x0 N=1 paddr=(nil)
CAS: Request from 127.0.0.1:33835 => cmmd=6 cid=0x1 type=5 count=13 postsize=16
CAS: Request from 127.0.0.1:33835 => available=0x1 N=2 paddr=(nil)
CAS: Sending a udp message of 40 bytes
CAS: conn req from 127.0.0.1:58648
CAS: Request from 127.0.0.1:58648 => cmmd=0 cid=0x0 type=0 count=13 postsize=0
CAS: Request from 127.0.0.1:58648 => available=0x0 N=1 paddr=(nil)
CAS: Request from 127.0.0.1:58648 => cmmd=20 cid=0x0 type=0 count=0 postsize=8
CAS: Request from 127.0.0.1:58648 => available=0x0 N=2 paddr=(nil)
CAS: Request from 127.0.0.1:58648 => cmmd=21 cid=0x0 type=0 count=0 postsize=8
CAS: Request from 127.0.0.1:58648 => available=0x0 N=3 paddr=(nil)
CAS: Request from 127.0.0.1:58648 => cmmd=18 cid=0x1 type=0 count=0 postsize=16
CAS: Request from 127.0.0.1:58648 => available=0xd N=4 paddr=(nil)
CAS: Sending a message of 48 bytes
CAS: Request from 127.0.0.1:58648 => cmmd=15 cid=0x7 type=0 count=2 postsize=0
CAS: Request from 127.0.0.1:58648 => available=0x1 N=1 paddr=0x7fca5c012178
CAS: Request from 127.0.0.1:58648 => cmmd=15 cid=0x7 type=7 count=2 postsize=0
CAS: Request from 127.0.0.1:58648 => available=0x2 N=2 paddr=0x7fca5c012178
CAS: Request from 127.0.0.1:58648 => cmmd=15 cid=0x7 type=14 count=2 postsize=0
CAS: Request from 127.0.0.1:58648 => available=0x3 N=3 paddr=0x7fca5c012178
CAS: Request from 127.0.0.1:58648 => cmmd=15 cid=0x7 type=21 count=2 postsize=0
CAS: Request from 127.0.0.1:58648 => available=0x4 N=4 paddr=0x7fca5c012178
CAS: Request from 127.0.0.1:58648 => cmmd=15 cid=0x7 type=28 count=2 postsize=0
CAS: Request from 127.0.0.1:58648 => available=0x5 N=5 paddr=0x7fca5c012178
CAS: Sending a message of 520 bytes
CAS: Request from 127.0.0.1:58648 => cmmd=12 cid=0x7 type=0 count=0 postsize=0
CAS: Request from 127.0.0.1:58648 => available=0x1 N=1 paddr=0x7fca5c012178
CAS: Sending a message of 16 bytes
CAS: nill message disconnect ( 8 bytes request )
CAS: Connection 22 Terminated

3.14.12 client:
CAS: Request from 127.0.0.1:52427 => cmmd=0 cid=0x1 type=1 count=13 postsize=0
CAS: Request from 127.0.0.1:52427 => available=0x0 N=1 paddr=(nil)
CAS: Request from 127.0.0.1:52427 => cmmd=6 cid=0x1 type=5 count=13 postsize=16
CAS: Request from 127.0.0.1:52427 => available=0x1 N=2 paddr=(nil)
CAS: Sending a udp message of 40 bytes
CAS: conn req from 127.0.0.1:58650
CAS: Request from 127.0.0.1:58650 => cmmd=0 cid=0x0 type=0 count=13 postsize=0
CAS: Request from 127.0.0.1:58650 => available=0x0 N=1 paddr=(nil)
CAS: Request from 127.0.0.1:58650 => cmmd=20 cid=0x0 type=0 count=0 postsize=8
CAS: Request from 127.0.0.1:58650 => available=0x0 N=2 paddr=(nil)
CAS: Request from 127.0.0.1:58650 => cmmd=21 cid=0x0 type=0 count=0 postsize=8
CAS: Request from 127.0.0.1:58650 => available=0x0 N=3 paddr=(nil)
CAS: Request from 127.0.0.1:58650 => cmmd=18 cid=0x...

Read more...

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

There is the log of a 3.14.12 IOC connecting to another 3.14.12 IOC:
CAS: cast server msg of 48 bytes from addr 172.20.3.200:59354
CAS: Request from 172.20.3.200:59354 => cmmd=0 cid=0x1 type=1 count=13 postsize=0
CAS: Request from 172.20.3.200:59354 => available=0x0 N=1 paddr=(nil)
CAS: Request from 172.20.3.200:59354 => cmmd=6 cid=0x1 type=5 count=13 postsize=16
CAS: Request from 172.20.3.200:59354 => available=0x1 N=2 paddr=(nil)
CAS: Sending a udp message of 40 bytes
CAS: conn req from 172.20.3.200:58144
CAS: Request from 172.20.3.200:58144 => cmmd=0 cid=0x0 type=80 count=13 postsize=0
CAS: Request from 172.20.3.200:58144 => available=0x0 N=1 paddr=(nil)
CAS: Request from 172.20.3.200:58144 => cmmd=20 cid=0x0 type=0 count=0 postsize=8
CAS: Request from 172.20.3.200:58144 => available=0x0 N=2 paddr=(nil)
CAS: Request from 172.20.3.200:58144 => cmmd=21 cid=0x0 type=0 count=0 postsize=16
CAS: Request from 172.20.3.200:58144 => available=0x0 N=3 paddr=(nil)
CAS: Request from 172.20.3.200:58144 => cmmd=18 cid=0x1 type=0 count=0 postsize=16
CAS: Request from 172.20.3.200:58144 => available=0xd N=4 paddr=(nil)
CAS: Sending a message of 48 bytes
CAS: Request from 172.20.3.200:58144 => cmmd=15 cid=0x0 type=34 count=1 postsize=0
CAS: Request from 172.20.3.200:58144 => available=0x1 N=1 paddr=0x7fd0bc010908
CAS: Request from 172.20.3.200:58144 => cmmd=1 cid=0x0 type=20 count=1 postsize=16
CAS: Request from 172.20.3.200:58144 => available=0x2 N=2 paddr=0x7fd0bc010908
CAS: Sending a message of 104 bytes
CAS: Sending a message of 40 bytes

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :
Download full text (8.2 KiB)

And now a 3.13.10 IOC connecting to a 3.14.12 IOC:

CAS: cast server msg of 16 bytes from addr 172.20.3.200:54015
CAS: Request from 172.20.3.200:54015 => cmmd=13 cid=0x0 type=0 count=5064 postsize=0
CAS: Request from 172.20.3.200:54015 => available=0x81810b32 N=1 paddr=(nil)
CAS: request from 172.20.3.200:54015 => invalid (damaged?) request code from UDP
CAS: Request from 172.20.3.200:54015 => cmmd=13 cid=0x0 type=0 count=5064 postsize=0
CAS: Request from 172.20.3.200:54015 => available=0x81810b32 N=0 paddr=(nil)
CAS: invalid (damaged?) UDP request from 172.20.3.200:54015 ?
CAS: message received at 2022-05-13 15:48:42
CAS: Sending a udp message of 16 bytes
CAS: cast server msg of 16 bytes from addr 172.20.3.200:54015
CAS: Request from 172.20.3.200:54015 => cmmd=13 cid=0x0 type=0 count=5064 postsize=0
CAS: Request from 172.20.3.200:54015 => available=0x81810b32 N=1 paddr=(nil)
CAS: request from 172.20.3.200:54015 => invalid (damaged?) request code from UDP
CAS: Request from 172.20.3.200:54015 => cmmd=13 cid=0x0 type=0 count=5064 postsize=0
CAS: Request from 172.20.3.200:54015 => available=0x81810b32 N=0 paddr=(nil)
CAS: invalid (damaged?) UDP request from 172.20.3.200:54015 ?
CAS: message received at 2022-05-13 15:48:42
CAS: Sending a udp message of 16 bytes
CAS: cast server msg of 32 bytes from addr 172.20.3.200:57086
CAS: Request from 172.20.3.200:57086 => cmmd=6 cid=0x0 type=5 count=8 postsize=16
CAS: Request from 172.20.3.200:57086 => available=0x0 N=1 paddr=(nil)
CAS: Sending a udp message of 40 bytes
CAS: cast server msg of 32 bytes from addr 172.20.3.200:57086
CAS: Request from 172.20.3.200:57086 => cmmd=6 cid=0x0 type=5 count=8 postsize=16
CAS: Request from 172.20.3.200:57086 => available=0x0 N=1 paddr=(nil)
CAS: Sending a udp message of 40 bytes
CAS: cast server msg of 16 bytes from addr 172.20.3.200:54015
CAS: Request from 172.20.3.200:54015 => cmmd=13 cid=0x0 type=0 count=5064 postsize=0
CAS: Request from 172.20.3.200:54015 => available=0x81810b32 N=1 paddr=(nil)
CAS: request from 172.20.3.200:54015 => invalid (damaged?) request code from UDP
CAS: Request from 172.20.3.200:54015 => cmmd=13 cid=0x0 type=0 count=5064 postsize=0
CAS: Request from 172.20.3.200:54015 => available=0x81810b32 N=0 paddr=(nil)
CAS: invalid (damaged?) UDP request from 172.20.3.200:54015 ?
CAS: message received at 2022-05-13 15:48:42
CAS: Sending a udp message of 16 bytes
CAS: conn req from 172.20.3.200:51232
CAS: Request from 172.20.3.200:51232 => cmmd=20 cid=0x0 type=0 count=0 postsize=8
CAS: Request from 172.20.3.200:51232 => available=0x0 N=1 paddr=(nil)
CAS: Request from 172.20.3.200:51232 => cmmd=21 cid=0x0 type=0 count=0 postsize=16
CAS: Request from 172.20.3.200:51232 => available=0x0 N=2 paddr=(nil)
CAS: Request from 172.20.3.200:51232 => cmmd=18 cid=0x0 type=0 count=0 postsize=16
CAS: Request from 172.20.3.200:51232 => available=0x8 N=3 paddr=(nil)
CAS: Sending a message of 48 bytes
CAS: Request from 172.20.3.200:51232 => cmmd=1 cid=0x0 type=20 count=1 postsize=16
CAS: Request from 172.20.3.200:51232 => available=0x0 N=1 paddr=0x7f3dd4010908
CAS: Sending a message of 40 byt...

Read more...

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote (last edit ):

The log on a 7.0.6 IOC with CA version check removed shows pretty much the same. Connection is successful and data is correct.

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :
Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

I now understand the problem:

EPICS 3.13 clients do not send a CA_PROTO_VERSION message in their search requests but instead have the CA minor version (which is 8) in the m_count field in the CA_PROTO_SEARCH message (the same location as the CA_PROTO_VERSION would have it).
EPICS 3.14+ clients still have that (in addition to the CA_PROTO_VERSION message). The code even checks the m_count field (and not client->minor_version) in search_reply_udp and search_reply_tcp. Also see the comment in search_reply_udp: "starting with V4.4 the count field is used (abused) to store the minor version number of the client."

But the connect consist of the TCP messages CA_PROTO_CLIENT_NAME, CA_PROTO_HOST_NAME and CA_PROTO_CREATE_CHAN. Again no initial CA_PROTO_VERSION message. Instead the version is embedded in the m_available field of CA_PROTO_CREATE_CHAN message. Thus we cannot test for the version until that message has been received.

Thus the "ignore deprecated clients" check in camessage would need to let the following messages pass without version test: CA_PROTO_VERSION, CA_PROTO_SEARCH, CA_PROTO_CLIENT_NAME, CA_PROTO_HOST_NAME, CA_PROTO_CREATE_CHAN. (The last 3 are TCP only).

What I do not understand is what that code is actually supposed to protect against. What are those "deprecated clients" and what would break if they would not be ignored? The log message just says "ignore CA client version older than v4.4" but not why.

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

Another small problem is that EPICS 3.13 IOCs may send their beacons to the "wrong" port. This happens when EPICS_CA_ADDR_LIST contains host:port entries. In this case, the IOCs send the beacons to the given search port too. These are the reason for the "invalid (damaged?) request code from UDP" messages. Some code is needed to ignore CA_PROTO_RSRV_IS_UP messages on the wrong port.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.