DB CA links are slow to reconnect when IOC is rebooted (see mantis-126)

Bug #541269 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Won't Fix
Medium
Andrew Johnson

Bug Description

Both Bob and Dirk have complained about this issue.

The root problem appears to be that the database needs to implement an orderly shutdown.

Additional information:
Here is a message I sent to Dirk:

> we have a problem with CA since we upgraded our MV2300 IOCs to Tornado2.
>
> After a reboot, often channel access links don't connect immediately to
> the server. They connect a few minutes later when this message is printed:
>
> CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> 22="S_errno_EINVAL"
>
> My guess is the following:
>
> The server does not know that the client rebooted.
> The client uses the same settings (e.g. dynamically assigned port
> number, etc) to connect to the server as last time. The packages now
> look exactly the same as during the last reboot.
>
> Since TCP involves resending lost packages, the server thinks the
> packages are duplicates and drops them.
>
> The server never replies and connect() fails.

Yes, this is one of the subtle compatibility issues with TCP/IP. For the guy that is implementing the server part of an IP kernel its a tradeoff. You have to be robust when hackers attempt a denial of service attack against the server, but you also might like to detect stale circuits and disconnect them when a client who is reusing the same ephemeral port keeps transmitting with the wrong TCP state or the wrong TCP sequence number. It's hard to have both.

> I had the idea to install a rebootHook that closes all CAC sockets, just
> to see if that helps.
>
> Unfortunately, I don't know how to get a list of all CAC sockets. I
> tried to iterate ca_static->ca_iiuList, but that didn't work. ca_static
> was either NULL (when I don't call ca_task_initialize) or the ca_iiuList
> is empty (when I do call ca_task_initialize).

When vxWorks does a soft reboot they do not initiate a TCP shutdown procedure for any sockets that might be open.

Last time I investigated this idea of implementing a rebbootHook in CA I concluded that vxWorks runs the reboot hooks in the wrong order (FIFO not LIFO). This means that vxWorks has already placed the network stack in an inaccessible state when it runs the CA installed reboot hook, and therefore there is nothing CA can do about closing sockets in a rebootHook. I asked WRS about this and they admitted that this was a mistake but said that they would not fix it. That interaction with WRS occurred about 15 years ago.

Ok, it gets worse. In principal we dont care about rebootHook architectural problems. We can have our own shell command that shuts down an EPICS IOC and initiates a soft reboot.

A fundamental unresolved problem is that if CA tries to automatically clean up at exit it can yank the carpet out from under other EPICS threads that do not have an orderly shutdown and continue to use CA after the exit handlers run. For example, with DB CA links that attach directly to in-memory DB records we saw problems where a subscription update from the database would continue to use a CA entity that had been cleaned up by CA's exit handler.

Fundamentally, you have to decide who is responsible for initiating cleanup. CA keeps a list of channels and it could clean them up, but that could cause problems if someone else tries to clean them up also. Everyone arrives eventually at the same conclusion. It is best to stick strictly to this rule. Whoever created it is responsible for deleting it.

Following that rule, one is lead to a conclusion that the application needs to first delete its CA channels and then delete the CA client context that it created.

With the DB CA links that responsibility lies with the database. The database does not currently have orderly shutdown procedures, but presumably this is high on the list for future versions of EPICS.

Mantis 126 was originally tracking this issue against R3.14. It has been marked as "wont fix". This was the last comment on mantis 126: "When ca_context_destroy was called it caused a crash when db_post_event was called for a channel access link to a local channel. For 3.14 there are no currently no plans to fix this since it probably means a complete cleanup of all recourses used by database records."

Original Mantis Bug: mantis-234
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=234

Tags: db vxworks
Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (7.3 KiB)

From Mark Rivers:

> > we have a problem with CA since we upgraded our MV2300 IOCs
> to Tornado2.
> >
> > After a reboot, often channel access links don't connect
> immediately to
> > the server. They connect a few minutes later when this
> message is printed:
> >
> > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > 22="S_errno_EINVAL"

This is not just a problem with IOC to IOC sockets, but with any vxWorks to vxWorks sockets.

We recently purchased a Newport XPS motor controller. It communicates over Ethernet, and uses vxWorks as it's operating system. We control the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will not communicate again after the IOC reboots, because it does not know the IOC rebooted, and the same ports are being used. It is thus necessary to also reboot the XPS when rebooting the IOC. But rebooting the XPS requires re-homing all of the motors, which is sometimes almost impossible because of installed equipment! This is a real pain.

This problem goes away if we control the XPS with a non-vxWorks IOC, such as Linux, probably because Linux closes the sockets when killing the IOC.

On a related topic, I am appending an exchange I had with Jeff Hill and others on this topic in October 2003, that was not posted to tech-talk.

Cheers,
Mark Rivers

Folks,

I'd like to revisit the problem of CA disconnects when rebooting a vxWorks client IOC that has CA links to a vxWorks server IOC (that does not reboot).

The EPICS 3.14.3 Release Notes say:

"Recent versions of vxWorks appear to experience a connect failure if the vxWorks IP kernel reassigns the same ephemeral TCP port number as was assigned during a previous lifetime. The IP kernel on the vxWorks system hosting the CA server might have a stale entry for this ephemeral port that has not yet timed out which prevents the client from connecting with the ephemeral port assigned by the IP kernel. Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is aborted and the client library closes the socket, opens a new socket, receives a new ephemeral port assignment, and successfully connects."

The last sentence is only partially correct. The problem is that:
- vxWorks assigns these ephemeral port numbers in ascending numerical order
- It takes a very long time for the server IOC to kill the stale entries

Thus, if I reboot the client many times in a row, it does not just result in one disconnect before the succesful connection, but many. I just did a test where I rebooted a vxWorks client IOC 11 times, as one might do when debugging IOC software. This IOC is running Marty's example sequence program, with 2 PVs connecting to a remote vxWorks server IOC.

Here is the amount of time elapsed before the sequence program PVs
connected:
Reboot # Time (sec)
1 0.1
2 5.7
3 30
4 60
5 90
6 120
7 30
8 150
9 150
10 180
11 210

Here is the output of "casr" on the vxWorks server IOC that never rebooted after client reboot #11. Channel Access Server V4.11
164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1 ...

Read more...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> > Here is a proposal for Jeff:
> >
> > Would it be possible to create a new function named something like
> > vxCAClientStopAll. This command would call close() on the CA
> > connections for all vxWorks CA clients, and hence would gracefully
> > close all of the socket connections on the server IOC.
> >
> > We could then make another new vxWorks command, "restart" which does
> > vxCAClientStopAll(); reboot();
>
> This is very awesome!!!
>
> Jeff can you implement this for the next EPICS RELEASE???
>
>
> Ernest
>

What Mark suggests is certainly a possible fix. If such a function were written its name, instead of vxCAClientStopAll(), might be instead
ca_close_circuits_but_dont_shut_anything_else_down() because if the rest of the CA infrastructure is not left in place the db threads that are still using it will crash and potentially disrupt the orderly shutdown.

There are different perspectives on this. One perspective is that CA already has such functions, ca_clear_channel and ca_context_destroy, and that all that is needed is a function called dbStopAll that calls them ;-). There would be many advantages to such an approach. One of them would be that devices could be shutdown also. For example the Allen Bradley TCP/IP circuits might also need to be gracefully shutdown.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Mark Rivers,

> There are different perspectives on this. One perspective is
> that CA already has such functions, ca_clear_channel and
ca_context_destroy,
> and that all that is needed is a function called dbStopAll that calls
them
> ;-). There would be many advantages to such an approach. One of them
> would be that devices could be shutdown also. For example the Allen
Bradley TCP/IP
> circuits might also need to be gracefully shutdown.

I like this suggestion, since in the case of the XPS controller I mentioned earlier, it is not a CA link to the XPS, but a socket opened with asyn. asyn or the driver needs to close that socket on shutdown in order to avoid the serious problems we are having on reboot.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

TBOMK, you should not experience trouble if you call ca_task_exit() w/o deleteing the channels in a non IOC environment. I don’t delete the channels (the application created them so it has to delete them), but I do unlink them from the context (rendering them inactive).

On the IOC the same is true. The problem is that if you call ca_context_destroy w/o destroying the subscriptions then db_post_event might post subscription updates into event queues that have been destroyed.

There is code to unlink the channel from its service entity. I suppose that this problem might be fixed by adding code that unlinks the subscription form its service entity.

Revision history for this message
Andrew Johnson (anj) wrote :

Since R3.14.10 the IOC has included shutdown code that disconnects all dbCa channels and then calls ca_context_destroy(). This code requires a vxWorks BSP modification (search tech-talk for "sysAtReboot") to ensure that the epicsAtExit() routines get run before the vxWorks network stack closes down, and it obviously can't run if you hit the VME reset button instead of doing an orderly shutdown. However with this in place the description above is no longer accurate.

We need to know whether that change has solved the problems reported by this bug. I'm marking this bug as Incomplete until we get some feedback.

Changed in epics-base:
status: New → Incomplete
Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (22.5 KiB)

> Embedded systems do not usually "shutdown". They are
> simply rebooted.

It's true that any well implemented system must recover from an ungraceful restart. Nevertheless, also in a well implemented embedded system we should orchestrate a graceful shutdown so that responsible users can choose that option. Furthermore, if the user does not ask for a graceful restart then he can expect some additional delays before obtaining a fully functional system again.

> Furthermore it is likely that at each reboot the IOC
> behaves the same, in particular that it uses the same
> ports for the same connections

I only mention this because it seems plausible to me that the embedded IP kernel could be better implemented. The last ephemeral port assignment could be cached in flash so that we don’t start over with the same base assignment, incrementing up from there, when the system is rebooted. Note that Eric Norum is on the distribution and I understand that he actually was a force behind porting in the IP kernel for the RTEMS. So I was curious about what opinions he might have on that issue.

> In my opinion the problem must be solved at connection,
> not at shutdown.

The error messages that you observed appear to indicate that the client library was waiting (a long time) for the IP kernel to connect the TCP circuit. Therefore, since CA is based on TCP then (as the code is implemented) this leaves this solving to the implementers of the IP kernel (but see my comments below).

> This is normal and worked fine with 3.13.
>

However, considering this observation, it does occur to me now that this observation could result from fundamental CA client library design differences between R3.13 and R3.14.

o In R3.13, all unresponsive TCP circuits were abruptly disconnected and restarted if they remained unresponsive for more than EPICS_CA_CONN_TMO seconds. This is the CA knows better approach.

o In R3.14, the CA client library simply waits for inactive TCP circuits to return to an active state, or else to be disconnected by the IP kernel. This is the TCP knows better approach.

R3.14 is actually very much more reliable (i.e. much less inclined towards positive congestion feedback) because it does not disconnect unresponsive circuits before TCP disconnects them.

Nevertheless, I recognize that possibly R3.13 would abandon an unresponsive, still connecting, circuit more rapidly and that very well might be perceived by users to be better behavior.

Presumably, vxWorks provides tuning parameters for its IP kernels which might allow users to choose the length of this type of delay. Admittedly this would be global for all TCP circuits, and not just for CA circuits.

I updated the bug entry with this message

Jeff
______________________________________________________
Jeffrey O. Hill Email <email address hidden>
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Hello again Dirk,

I should add that this has been a good discussion. It's certainly possible that changes could be made to address this type of issue. However, I need to be conservative about such changes because I believe that the R3.14 aggregate EPICS system is, by design, much more reliable than the R3.13 aggregate system.

How often do these type of reboot reconnecting issue occur? What is the typical delay before the system reconnects in such situations? What is the maximum delay before a system reconnected in such situations.

I will update the bug report with your answers.

tags: added: vxworks
Changed in epics-base:
assignee: nobody → Andrew Johnson (anj)
Andrew Johnson (anj)
Changed in epics-base:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.