"unterminated PV name in UDP search request" kills all our gateways

Bug #1804662 reported by Ben Franksen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
New
Undecided
Unassigned

Bug Description

On Nov 21 16:55 all our CA gateways suddenly stopped working. They didn't crash, just no longer served any requests. This is from the gateway logs:

ctl-192-168-21-0-24/gateway.log-CAS Request: ? on usd-010-102.usd.bessy.de:37925: cmd=6 cid=1166 typ=5 cnt=11 psz=40 avail=48e
ctl-192-168-21-0-24/gateway.log-CAS:
ctl-192-168-21-0-24/gateway.log-Nov 21 16:55:00 !!! Errlog message received (message is above)
ctl-192-168-21-0-24/gateway.log:unterminated PV name in UDP search request?
ctl-192-168-21-0-24/gateway.log-
ctl-192-168-21-0-24/gateway.log-Nov 21 16:55:00 !!! Errlog message received (message is above)

Grepping for the error message reveals two locations in the PCAS:

franksen@tiber: ...src/epics-base/3.15 > grep -r 'unterminated PV name in UDP search request'
src/ca/legacy/pcas/generic/casDGClient.cc: "unterminated PV name in UDP search request?\n" );
src/ca/legacy/pcas/generic/casStrmClient.cc: "unterminated PV name in UDP search request?\n" );

So, while PCAS detects that there is something wrong with the request, it (or the CA gateway) cannot quite recover from this situation.

Unfortunately we weren't able to secure a core-dump because tere was pressure to get things going again, which meant restarting all gateways. If it happens again I think we will be better prepared, so we may be able to give more information to diagnose the problem.

Revision history for this message
Ben Franksen (bfrk) wrote :

This happened with base-3.15.5 and cagateway-2.1.0.0, the OS is Debian-8.9.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Unless you're using name resolution on TCP (EPICS_CA_NAME_SERVERS), the code in casDGClient.cc applies (DG = datagram).

What is the CA client that sent the name request that PCAS found illegal? (usd-010-102.usd.bessy.de:37925)

Revision history for this message
Ben Franksen (bfrk) wrote :

I should add that all gateways (more than 30 IIRC) issued the same message to the log at exactly the same time.

We will try to find out who owns usd-010-102.usd.bessy.de and what clients run on it.

At a first glance, the code in casDGClient.cc looks okay to me.

I find it curious that psz=40. I think 'psz' here is the size of the CA payload in bytes and the 40 suggests a client that uses MAX_STRING_SIZE or something. This doesn't look like the client is based on a recent version of the native CA client library. Which suggests to me that perhaps what happened is an overrun in the client and the rest of the PV name now gets sent as the next CA package header and /that/ causes the gateways to stop working.

Do we have tests in EPICS base to see how a server handles randomly generated CA packages? I guess not. I wonder what happens, for instance, if the actual size of the UDP package is smaller or larger than what the CA server calculates based on the size of the payload in the header.

Revision history for this message
Andrew Johnson (anj) wrote :

@Ralph if one of those errors actually comes from TCP name resolution, it might be a good idea to change the message to be "unterminated PV name in TCP search request".

@Ben I don't think these kinds of tests belong in Base as such, but there is now a pure Python implementation of the CA protocol from NSLS-2 called "caproto" which might be a good basis for investigating protocol issues. I have not played with it but it was discussed at the meeting in Melbourne. https://nsls-ii.github.io/caproto/

Michael has also developed some Python tools for CA protocol development, https://github.com/mdavidsaver/twistedca

Revision history for this message
Ben Franksen (bfrk) wrote :

Thanks for the pointers. Interesting projects but not what I need to reproduce this problem.

I have written a small (~50 LOC) "rogue" client program that sends out broken CA_PROTO_SEARCH packages. So far I have not been able to get my test gateway to stop working. I have not yet tried to randomize packages and send them in a loop. Perhaps I will try that as the next step.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

@Andrew: misleading error message fixed in 3.15 and pcas module.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.