DB CA links are slow to reconnect when IOC is rebooted (see mantis-126)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Won't Fix
|
Medium
|
Andrew Johnson |
Bug Description
Both Bob and Dirk have complained about this issue.
The root problem appears to be that the database needs to implement an orderly shutdown.
Additional information:
Here is a message I sent to Dirk:
> we have a problem with CA since we upgraded our MV2300 IOCs to Tornado2.
>
> After a reboot, often channel access links don't connect immediately to
> the server. They connect a few minutes later when this message is printed:
>
> CAC: Unable to connect port 5064 on "172.19.
> 22="S_errno_EINVAL"
>
> My guess is the following:
>
> The server does not know that the client rebooted.
> The client uses the same settings (e.g. dynamically assigned port
> number, etc) to connect to the server as last time. The packages now
> look exactly the same as during the last reboot.
>
> Since TCP involves resending lost packages, the server thinks the
> packages are duplicates and drops them.
>
> The server never replies and connect() fails.
Yes, this is one of the subtle compatibility issues with TCP/IP. For the guy that is implementing the server part of an IP kernel its a tradeoff. You have to be robust when hackers attempt a denial of service attack against the server, but you also might like to detect stale circuits and disconnect them when a client who is reusing the same ephemeral port keeps transmitting with the wrong TCP state or the wrong TCP sequence number. It's hard to have both.
> I had the idea to install a rebootHook that closes all CAC sockets, just
> to see if that helps.
>
> Unfortunately, I don't know how to get a list of all CAC sockets. I
> tried to iterate ca_static-
> was either NULL (when I don't call ca_task_initialize) or the ca_iiuList
> is empty (when I do call ca_task_
When vxWorks does a soft reboot they do not initiate a TCP shutdown procedure for any sockets that might be open.
Last time I investigated this idea of implementing a rebbootHook in CA I concluded that vxWorks runs the reboot hooks in the wrong order (FIFO not LIFO). This means that vxWorks has already placed the network stack in an inaccessible state when it runs the CA installed reboot hook, and therefore there is nothing CA can do about closing sockets in a rebootHook. I asked WRS about this and they admitted that this was a mistake but said that they would not fix it. That interaction with WRS occurred about 15 years ago.
Ok, it gets worse. In principal we dont care about rebootHook architectural problems. We can have our own shell command that shuts down an EPICS IOC and initiates a soft reboot.
A fundamental unresolved problem is that if CA tries to automatically clean up at exit it can yank the carpet out from under other EPICS threads that do not have an orderly shutdown and continue to use CA after the exit handlers run. For example, with DB CA links that attach directly to in-memory DB records we saw problems where a subscription update from the database would continue to use a CA entity that had been cleaned up by CA's exit handler.
Fundamentally, you have to decide who is responsible for initiating cleanup. CA keeps a list of channels and it could clean them up, but that could cause problems if someone else tries to clean them up also. Everyone arrives eventually at the same conclusion. It is best to stick strictly to this rule. Whoever created it is responsible for deleting it.
Following that rule, one is lead to a conclusion that the application needs to first delete its CA channels and then delete the CA client context that it created.
With the DB CA links that responsibility lies with the database. The database does not currently have orderly shutdown procedures, but presumably this is high on the list for future versions of EPICS.
Mantis 126 was originally tracking this issue against R3.14. It has been marked as "wont fix". This was the last comment on mantis 126: "When ca_context_destroy was called it caused a crash when db_post_event was called for a channel access link to a local channel. For 3.14 there are no currently no plans to fix this since it probably means a complete cleanup of all recourses used by database records."
Original Mantis Bug: mantis-234
http://
tags: | added: vxworks |
Changed in epics-base: | |
assignee: | nobody → Andrew Johnson (anj) |
Changed in epics-base: | |
status: | Incomplete → Won't Fix |
From Mark Rivers:
> > we have a problem with CA since we upgraded our MV2300 IOCs 157.20: 5064" because
> to Tornado2.
> >
> > After a reboot, often channel access links don't connect
> immediately to
> > the server. They connect a few minutes later when this
> message is printed:
> >
> > CAC: Unable to connect port 5064 on "172.19.
> > 22="S_errno_EINVAL"
This is not just a problem with IOC to IOC sockets, but with any vxWorks to vxWorks sockets.
We recently purchased a Newport XPS motor controller. It communicates over Ethernet, and uses vxWorks as it's operating system. We control the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will not communicate again after the IOC reboots, because it does not know the IOC rebooted, and the same ports are being used. It is thus necessary to also reboot the XPS when rebooting the IOC. But rebooting the XPS requires re-homing all of the motors, which is sometimes almost impossible because of installed equipment! This is a real pain.
This problem goes away if we control the XPS with a non-vxWorks IOC, such as Linux, probably because Linux closes the sockets when killing the IOC.
On a related topic, I am appending an exchange I had with Jeff Hill and others on this topic in October 2003, that was not posted to tech-talk.
Cheers,
Mark Rivers
Folks,
I'd like to revisit the problem of CA disconnects when rebooting a vxWorks client IOC that has CA links to a vxWorks server IOC (that does not reboot).
The EPICS 3.14.3 Release Notes say:
"Recent versions of vxWorks appear to experience a connect failure if the vxWorks IP kernel reassigns the same ephemeral TCP port number as was assigned during a previous lifetime. The IP kernel on the vxWorks system hosting the CA server might have a stale entry for this ephemeral port that has not yet timed out which prevents the client from connecting with the ephemeral port assigned by the IP kernel. Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is aborted and the client library closes the socket, opens a new socket, receives a new ephemeral port assignment, and successfully connects."
The last sentence is only partially correct. The problem is that:
- vxWorks assigns these ephemeral port numbers in ascending numerical order
- It takes a very long time for the server IOC to kill the stale entries
Thus, if I reboot the client many times in a row, it does not just result in one disconnect before the succesful connection, but many. I just did a test where I rebooted a vxWorks client IOC 11 times, as one might do when debugging IOC software. This IOC is running Marty's example sequence program, with 2 PVs connecting to a remote vxWorks server IOC.
Here is the amount of time elapsed before the sequence program PVs
connected:
Reboot # Time (sec)
1 0.1
2 5.7
3 30
4 60
5 90
6 120
7 30
8 150
9 150
10 180
11 210
Here is the output of "casr" on the vxWorks server IOC that never rebooted after client reboot #11. Channel Access Server V4.11 160.74: 1067(ioc13bma) : User="iocboot", V4.11, Channel Count=1 ...
164.54.