Bug #541269 (mantis-234) “DB CA links are slow to reconnect when...” : Bugs : EPICS Base

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2006-01-11:

#1

Download full text (7.3 KiB)

From Mark Rivers:

> > we have a problem with CA since we upgraded our MV2300 IOCs
> to Tornado2.
> >
> > After a reboot, often channel access links don't connect
> immediately to
> > the server. They connect a few minutes later when this
> message is printed:
> >
> > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > 22="S_errno_EINVAL"

This is not just a problem with IOC to IOC sockets, but with any vxWorks to vxWorks sockets.

We recently purchased a Newport XPS motor controller. It communicates over Ethernet, and uses vxWorks as it's operating system. We control the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will not communicate again after the IOC reboots, because it does not know the IOC rebooted, and the same ports are being used. It is thus necessary to also reboot the XPS when rebooting the IOC. But rebooting the XPS requires re-homing all of the motors, which is sometimes almost impossible because of installed equipment! This is a real pain.

This problem goes away if we control the XPS with a non-vxWorks IOC, such as Linux, probably because Linux closes the sockets when killing the IOC.

On a related topic, I am appending an exchange I had with Jeff Hill and others on this topic in October 2003, that was not posted to tech-talk.

Cheers,
Mark Rivers

Folks,

I'd like to revisit the problem of CA disconnects when rebooting a vxWorks client IOC that has CA links to a vxWorks server IOC (that does not reboot).

The EPICS 3.14.3 Release Notes say:

"Recent versions of vxWorks appear to experience a connect failure if the vxWorks IP kernel reassigns the same ephemeral TCP port number as was assigned during a previous lifetime. The IP kernel on the vxWorks system hosting the CA server might have a stale entry for this ephemeral port that has not yet timed out which prevents the client from connecting with the ephemeral port assigned by the IP kernel. Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is aborted and the client library closes the socket, opens a new socket, receives a new ephemeral port assignment, and successfully connects."

The last sentence is only partially correct. The problem is that:
- vxWorks assigns these ephemeral port numbers in ascending numerical order
- It takes a very long time for the server IOC to kill the stale entries

Thus, if I reboot the client many times in a row, it does not just result in one disconnect before the succesful connection, but many. I just did a test where I rebooted a vxWorks client IOC 11 times, as one might do when debugging IOC software. This IOC is running Marty's example sequence program, with 2 PVs connecting to a remote vxWorks server IOC.

Here is the amount of time elapsed before the sequence program PVs
connected:
Reboot # Time (sec)
1 0.1
2 5.7
3 30
4 60
5 90
6 120
7 30
8 150
9 150
10 180
11 210

Here is the output of "casr" on the vxWorks server IOC that never rebooted after client reboot #11. Channel Access Server V4.11
164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1 ...

From Mark Rivers:

> > we have a problem with CA since we upgraded our MV2300 IOCs
> to Tornado2.
> > 
> > After a reboot, often channel access links don't connect
> immediately to
> > the server. They connect a few minutes later when this
> message is printed:
> > 
> > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> >   22="S_errno_EINVAL"

This is not just a problem with IOC to IOC sockets, but with any vxWorks to vxWorks sockets.

We recently purchased a Newport XPS motor controller.  It communicates over Ethernet, and uses vxWorks as it's operating system.  We control the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will not communicate again after the IOC reboots, because it does not know the IOC rebooted, and the same ports are being used.  It is thus necessary to also reboot the XPS when rebooting the IOC.  But rebooting the XPS requires re-homing all of the motors, which is sometimes almost impossible because of installed equipment!  This is a real pain.

This problem goes away if we control the XPS with a non-vxWorks IOC, such as Linux, probably because Linux closes the sockets when killing the IOC.

On a related topic, I am appending an exchange I had with Jeff Hill and others on this topic in October 2003, that was not posted to tech-talk.

Cheers,
Mark Rivers

Folks,

I'd like to revisit the problem of CA disconnects when rebooting a vxWorks client IOC that has CA links to a vxWorks server IOC (that does not reboot).

The EPICS 3.14.3 Release Notes say:

"Recent versions of vxWorks appear to experience a connect failure if the vxWorks IP kernel reassigns the same ephemeral TCP port number as was assigned during a previous lifetime. The IP kernel on the vxWorks system hosting the CA server might have a stale entry for this ephemeral port that has not yet timed out which prevents the client from connecting with the ephemeral port assigned by the IP kernel. Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is aborted and the client library closes the socket, opens a new socket, receives a new ephemeral port assignment, and successfully connects."

The last sentence is only partially correct.  The problem is that:
- vxWorks assigns these ephemeral port numbers in ascending numerical order
- It takes a very long time for the server IOC to kill the stale entries

Thus, if I reboot the client many times in a row, it does not just result in one disconnect before the succesful connection, but many.  I just did a test where I rebooted a vxWorks client IOC 11 times, as one might do when debugging IOC software.  This IOC is running Marty's example sequence program, with 2 PVs connecting to a remote vxWorks server IOC.

Here is the amount of time elapsed before the sequence program PVs
connected:
Reboot #  Time (sec)
1           0.1
2           5.7
3            30
4            60
5            90
6           120
7            30
8           150
9           150
10          180
11          210

Here is the output of "casr" on the vxWorks server IOC that never rebooted after client reboot #11. Channel Access Server V4.11
164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1 Priority=80
164.54.160.100:4453(miata): User="dac_user", V4.8, Channel Count=461 Priority=0
164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1 Priority=80
164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel Count=18 Priority=0
164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0
164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0
164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0
164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0
164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0
164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0
164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0
164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster", V4.8, Channel Count=291 Priority=0
164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2 Priority=0

There should only be one connection from the client, 164.54.160.73 (ioc13lab).  All but the highest numbered port (1032) are stale.

The connection times do not increase by 30 seconds every single time, because for some reason every once in a while one of the old port connections times out (?) and is reused.  You can see that 1026 was reused in the above test. But in general they do increase by 30 seconds on each reboot.

This situation makes it very difficult to do software development under vxWorks in the case where CA connections to other vxWorks IOCs are used. It starts to take 4 or 5 minutes for the CA connections to get established.  Rebooting the server IOC is often not an option.

Here is a proposal for Jeff:

Would it be possible to create a new function named something like vxCAClientStopAll.  This command would call close() on the CA connections for all vxWorks CA clients, and hence would gracefully close all of the socket connections on the server IOC.

We could then make another new vxWorks command, "restart" which does vxCAClientStopAll(); reboot();

This would not solve the problem for hard reboots, but it would make it possible in many cases to avoid these long delays in cases where an IOC is being deliberately rebooted under software control.

Cheers,
Mark

Jeff's reply was:
Mark,

> - vxWorks assigns these ephemeral port numbers in ascending numerical 
> order

That's correct there could be several of these stale circuits and the system will sequentially step through ephemeral port assignments timing out each one until an open slot is found. One solution would be for WRS to store the last ephemeral port assignment in non-volatile RAM between boots.

It's also true that this problem is mostly a development issue and not an operational issue because during operations machines typically stay in a booted operational state for much longer than the stale circuit timeout interval.

> - It takes a very long time for the server IOC to kill the stale
> entries

Yes, that's true. I do turn on the keep-alive timer, but it has a very long delay by default. This delay *can* however be changed globally for all circuits.

I don't know what RTEMS does, but I strongly suspect that windows, UNIX, and VMS systems hang up all connected circuits when the system is software rebooted.

Therefore, we have a vxWorks and possibly an RTEMS specific problem.

> Would it be possible to create a new function named something like 
> vxCAClientStopAll.  This command would call close() on the CA 
> connections for all vxWorks CA clients, and hence would gracefully 
> close all of the socket connections on the server IOC.
>

Of course ca_context_destroy() and ca_task_exit() are fulfilling a similar, but context specific role. They do however shutdown only one context at a time, and the context identifier is private to the context.

So perhaps we should do this:

Implement an iocCore shutdown module where subsystems register for callback when iocCore is shutdown. There would be a command line function that users call to shutdown an IOC gracefully. This command line would call all of the callbacks in the LIFO order. The sequencer and the database links would of course call ca_context_destroy() in their IOC core shutdown callbacks.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2006-01-11:

#2

> > Here is a proposal for Jeff:
> >
> > Would it be possible to create a new function named something like
> > vxCAClientStopAll. This command would call close() on the CA
> > connections for all vxWorks CA clients, and hence would gracefully
> > close all of the socket connections on the server IOC.
> >
> > We could then make another new vxWorks command, "restart" which does
> > vxCAClientStopAll(); reboot();
>
> This is very awesome!!!
>
> Jeff can you implement this for the next EPICS RELEASE???
>
>
> Ernest
>

What Mark suggests is certainly a possible fix. If such a function were written its name, instead of vxCAClientStopAll(), might be instead
ca_close_circuits_but_dont_shut_anything_else_down() because if the rest of the CA infrastructure is not left in place the db threads that are still using it will crash and potentially disrupt the orderly shutdown.

There are different perspectives on this. One perspective is that CA already has such functions, ca_clear_channel and ca_context_destroy, and that all that is needed is a function called dbStopAll that calls them ;-). There would be many advantages to such an approach. One of them would be that devices could be shutdown also. For example the Allen Bradley TCP/IP circuits might also need to be gracefully shutdown.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2006-01-11:

#3

From Mark Rivers,

> There are different perspectives on this. One perspective is
> that CA already has such functions, ca_clear_channel and
ca_context_destroy,
> and that all that is needed is a function called dbStopAll that calls
them
> ;-). There would be many advantages to such an approach. One of them
> would be that devices could be shutdown also. For example the Allen
Bradley TCP/IP
> circuits might also need to be gracefully shutdown.

I like this suggestion, since in the case of the XPS controller I mentioned earlier, it is not a CA link to the XPS, but a socket opened with asyn. asyn or the driver needs to close that socket on shutdown in order to avoid the serious problems we are having on reboot.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2006-02-10:

#4

TBOMK, you should not experience trouble if you call ca_task_exit() w/o deleteing the channels in a non IOC environment. I don’t delete the channels (the application created them so it has to delete them), but I do unlink them from the context (rendering them inactive).

On the IOC the same is true. The problem is that if you call ca_context_destroy w/o destroying the subscriptions then db_post_event might post subscription updates into event queues that have been destroyed.

There is code to unlink the channel from its service entity. I suppose that this problem might be fixed by adding code that unlinks the subscription form its service entity.

Revision history for this message

Andrew Johnson (anj) wrote on 2010-09-28:

#5

Since R3.14.10 the IOC has included shutdown code that disconnects all dbCa channels and then calls ca_context_destroy(). This code requires a vxWorks BSP modification (search tech-talk for "sysAtReboot") to ensure that the epicsAtExit() routines get run before the vxWorks network stack closes down, and it obviously can't run if you hit the VME reset button instead of doing an orderly shutdown. However with this in place the description above is no longer accurate.

We need to know whether that change has solved the problems reported by this bug. I'm marking this bug as Incomplete until we get some feedback.

Changed in epics-base:
status:	New → Incomplete

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-02-08:

#6

Download full text (22.5 KiB)

> Embedded systems do not usually "shutdown". They are
> simply rebooted.

It's true that any well implemented system must recover from an ungraceful restart. Nevertheless, also in a well implemented embedded system we should orchestrate a graceful shutdown so that responsible users can choose that option. Furthermore, if the user does not ask for a graceful restart then he can expect some additional delays before obtaining a fully functional system again.

> Furthermore it is likely that at each reboot the IOC
> behaves the same, in particular that it uses the same
> ports for the same connections

I only mention this because it seems plausible to me that the embedded IP kernel could be better implemented. The last ephemeral port assignment could be cached in flash so that we don’t start over with the same base assignment, incrementing up from there, when the system is rebooted. Note that Eric Norum is on the distribution and I understand that he actually was a force behind porting in the IP kernel for the RTEMS. So I was curious about what opinions he might have on that issue.

> In my opinion the problem must be solved at connection,
> not at shutdown.

The error messages that you observed appear to indicate that the client library was waiting (a long time) for the IP kernel to connect the TCP circuit. Therefore, since CA is based on TCP then (as the code is implemented) this leaves this solving to the implementers of the IP kernel (but see my comments below).

> This is normal and worked fine with 3.13.
>

However, considering this observation, it does occur to me now that this observation could result from fundamental CA client library design differences between R3.13 and R3.14.

o In R3.13, all unresponsive TCP circuits were abruptly disconnected and restarted if they remained unresponsive for more than EPICS_CA_CONN_TMO seconds. This is the CA knows better approach.

o In R3.14, the CA client library simply waits for inactive TCP circuits to return to an active state, or else to be disconnected by the IP kernel. This is the TCP knows better approach.

R3.14 is actually very much more reliable (i.e. much less inclined towards positive congestion feedback) because it does not disconnect unresponsive circuits before TCP disconnects them.

Nevertheless, I recognize that possibly R3.13 would abandon an unresponsive, still connecting, circuit more rapidly and that very well might be perceived by users to be better behavior.

Presumably, vxWorks provides tuning parameters for its IP kernels which might allow users to choose the length of this type of delay. Admittedly this would be global for all TCP circuits, and not just for CA circuits.

I updated the bug entry with this message

Jeff
______________________________________________________
Jeffrey O. Hill Email <email address hidden>
LANL MS H820 Voice 505 665 1831
Los Alamos NM 87545 USA FAX 505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925...

> Embedded systems do not usually "shutdown". They are 
> simply rebooted.

It's true that any well implemented system must recover from an ungraceful restart. Nevertheless, also in a well implemented embedded system we should orchestrate a graceful shutdown so that responsible users can choose that option. Furthermore, if the user does not ask for a graceful restart then he can expect some additional delays before obtaining a fully functional system again.

> Furthermore it is likely that at each reboot the IOC 
> behaves the same, in particular that it uses the same 
> ports for the same connections

I only mention this because it seems plausible to me that the embedded IP kernel could be better implemented. The last ephemeral port assignment could be cached in flash so that we don’t start over with the same base assignment, incrementing up from there, when the system is rebooted. Note that Eric Norum is on the distribution and I understand that he actually was a force behind porting in the IP kernel for the RTEMS. So I was curious about what opinions he might have on that issue.

> In my opinion the problem must be solved at connection, 
> not at shutdown.

The error messages that you observed appear to indicate that the client library was waiting (a long time) for the IP kernel to connect the TCP circuit. Therefore, since CA is based on TCP then (as the code is implemented) this leaves this solving to the implementers of the IP kernel (but see my comments below).

> This is normal and worked fine with 3.13.
>

However, considering this observation, it does occur to me now that this observation could result from fundamental CA client library design differences between R3.13 and R3.14.

o In R3.13, all unresponsive TCP circuits were abruptly disconnected and restarted if they remained unresponsive for more than EPICS_CA_CONN_TMO seconds. This is the CA knows better approach.

o In R3.14, the CA client library simply waits for inactive TCP circuits to return to an active state, or else to be disconnected by the IP kernel. This is the TCP knows better approach.

R3.14 is actually very much more reliable (i.e. much less inclined towards positive congestion feedback) because it does not disconnect unresponsive circuits before TCP disconnects them.

Nevertheless, I recognize that possibly R3.13 would abandon an unresponsive, still connecting, circuit more rapidly and that very well might be perceived by users to be better behavior.

Presumably, vxWorks provides tuning parameters for its IP kernels which might allow users to choose the length of this type of delay. Admittedly this would be global for all TCP circuits, and not just for CA circuits.

I updated the bug entry with this message

Jeff
______________________________________________________
Jeffrey O. Hill           Email        johill@lanl.gov
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA

With sufficient thrust, pigs fly just fine. However, this is
not necessarily a good idea. It is hard to be sure where they
are going to land, and it could be dangerous sitting under them
as they fly overhead. -- RFC 1925

> -----Original Message-----
> From: Dirk Zimoch [mailto:dirk.zimoch@psi.ch]
> Sent: Tuesday, February 08, 2011 4:05 AM
> To: Jeff Hill
> Cc: 'Ernest L. Williams Jr.'; 'Till Straumann'; 'Fairley, Diane';
> 'Piccoli, Luciano'; 'Allison, Stephanie'; 'Shankar, Murali'; 'Zelazny,
> Michael S.'; 'Rogind, Debbie'; 'Kim, Kukhee'; 'Geng, Zheqiao'; 'Eric
> Norum'; 'Andrew Johnson'
> Subject: Re: Fast Feedback and BLD CA reconnection problem status
> 
> Hi Jeff,
> 
> Embedded systems do not usually "shutdown". They are simply rebooted.
> Furthermore it is likely that at each reboot the IOC behaves the same,
> in particular that it uses the same ports for the same connections if
> nothing has changed in the configuration. This is normal and worked fine
> with 3.13.
> 
> I don't think that any of your approaches will help with (vxWorks,
> RTEMS, embedded Linux) IOCs being just reseted.
> 
> BTW, as I told you on the last Codeathlon, client IOCs also often don't
> reconnect when a server IOC dies from CPU starvation and is then
> rebooted. Also in this case, orderly shutdown procedures don't help.
> 
> In my opinion the problem must be solved at connection, not at shutdown.
> 
> Dirk
> 
> 
> Jeff Hill wrote:
> > Hi Ernest,
> >
> > Sorry about delay responding. LANL was closed Friday due to natural gas
> > shortages in northern NM (power brownouts in west tx disrupted the gas
> > pipeline compressors).
> >
> > Here is the executive summary for bug 541269.
> >
> > This message in Dirk's situation leaves us with some tangible evidence
> of
> > what might be occurring.
> >
> >> CAC: Unable to connect port 5064 on "172.19.157.20:5064"
> >
> > Connect delays might occur if a TCP client in a rebooted embedded IOC
> tries
> > to use the circuit that is already established between its past lifetime
> and
> > its TCP server. The server identified its circuits uniquely by the {ip
> > address, port} pair for both sides of the circuit, but in practice it’s
> the
> > dynamic (ephemeral) port assignment on the client side that matters
> (that
> > causes uniqueness). If the IOC's db-ca-link client receives the same
> > ephemeral port assignments there can be long delays finding out that the
> old
> > TCP circuit can't be reused.
> >
> > We just might avoid such delays when reconnecting the embedded system if
> one
> > of the following were accomplished.
> >
> > 1) The embedded IP kernel did not reallocate the same ephemeral TCP
> port, as
> > was used in the previous incarnation, for the IOCs reincarnation.
> >
> > 2) The embedded system OS _crowbar_forced_ tcp circuit shutdown during
> the
> > soft shutdown portion of a soft reboot. Workstation OS do implement this
> > type of forced shutdown.
> >
> > 3) If the ca client library had a _crowbar_forced_ shutdown for its
> sockets.
> > The ca client library _does_ have an orderly shutdown for its tcp
> circuits,
> > but this occurs only if the client side application commands it (via
> > ca_clear_channel or ca_context_destroy). If this was done I would
> probably
> > need to add lots of extra code (complexity) that tests to see if a ca
> client
> > context, or ca channel, which hasn’t been destroyed has received a
> crowbar
> > shutdown. Otherwise there would be a shutdown crash where the database
> > threads continue to use a ca client context, or ca channel, that has
> already
> > been destroyed.
> >
> > 4) If the db ca link code had an orderly shutdown that ran when the IOC
> > exits. This orderly shutdown would of course destroy its ca client
> context.
> > This would probably require an orderly shutdown also for the database.
> >
> > I have been presuming that the code in (4) would be the most expedient
> fix
> > for this issue. The alternative in (3) probably adds more code to the
> > system, than the fix in (4), with no clear benefit.
> >
> > Looking closely at the bug report I see that Andrew indicates that (4)
> had
> > been implemented as of R3.14.10 (see entry number 5 on the bug report).
> >
> > Such changes would only help if the IOC is soft rebooted.
> >
> > Jeff
> > ______________________________________________________
> > Jeffrey O. Hill           Email        johill@lanl.gov
> > LANL MS H820              Voice        505 665 1831
> > Los Alamos NM 87545 USA   FAX          505 665 5107
> >
> > Message content: TSPA
> >
> > With sufficient thrust, pigs fly just fine. However, this is
> > not necessarily a good idea. It is hard to be sure where they
> > are going to land, and it could be dangerous sitting under them
> > as they fly overhead. -- RFC 1925
> >
> >
> >> -----Original Message-----
> >> From: Ernest L. Williams Jr. [mailto:ernesto@slac.stanford.edu]
> >> Sent: Saturday, February 05, 2011 12:16 PM
> >> To: Till Straumann
> >> Cc: Ernest L. Williams Jr.; Fairley, Diane; Piccoli, Luciano; Allison,
> >> Stephanie; Shankar, Murali; Zelazny, Michael S.; Rogind, Debbie; Kim,
> >> Kukhee; Geng, Zheqiao; Jeff Hill; Eric Norum
> >> Subject: Re: Fast Feedback and BLD CA reconnection problem status
> >>
> >> Hi Till/Diane,
> >>
> >> I do not find any problem in my tests for EPICS Records that use
> Channel
> >> access links
> >> to other IOCs,
> >> The IOC(s) can be either EPICS R3-14-11 or EPICS R3-14-8-2
> >>
> >> If you have your own version of some test code, please provide it.
> >> I am just using EPICS records to show that there is no problem.
> >>
> >> Also, with respect to the bug report I do not see these messages:
> >> ===================================================
> >> CAC: Unable to connect port 5064 on "172.19.
> >>
> >> 157.20:5064" because 22="S_errno_EINVAL"
> >>
> >> ====================================================
> >> https://bugs.launchpad.net/epics-base/+bug/541269
> >> Jeff, can you provide some more context?
> >> We are now using vxWorks 6.6 and higher and RTEMS 4.9.3 and higher.
> >> I do not see these problems reported in the bug report at least in my
> >> tests below.
> >>
> >> Eric, have you seen the problem referenced by the above bug report?
> >>
> >>
> >> Please read the full test carefully.
> >> ========================================================
> >> We will use the following EPICS tools for this test:
> >> camonitor CA:Server:aiExample CA:Client:caReconnect
> >> probe camonitor CA:Server:aiExample
> >> probe CA:Client:caReconnect
> >> See attachments
> >> ========================================================
> >> I will use the following IOC(s) for these tests:
> >> "ioc-b34-bd01",  "ioc-b34-cd11",  and "ioc-b34-ev02"
> >> ========================================================
> >>
> >> *(1) ioc-b34-bd01:**==> EPICS R3-14-8-2*
> >> Current RTOS:
> >> afsnfs2:/afs/slac:/package/rtems/4.9.3/target/ssrlApps/powerpc-
> >> rtems/beatnik/bin/rtems.ralf
> >>
> >> Current EPICS:
> >> Cexp@ioc-b34-bd01>coreRelease()
> >>
> #########################################################################
> >> ###
> >> ###  EPICS IOC CORE built on Oct  5 2010
> >> ###  EPICS R3.14.8.2-LCLS_6 $R3-14-8-2$ $2006/01/06 15:55:13$
> >>
> #########################################################################
> >> ###
> >>
> >> Purpose: provide a record that acts a CA client to another IOC running
> >> R3.14.11
> >>
> >> record(calc, "$(user):caReconnect")
> >> {
> >>         field(DESC, "Read from ioc-b34-cd11")
> >>         field(SCAN, "1 second")
> >>         field(CALC, "(A*B)")
> >>         field(INPA, "CA:Server:aiExample.VAL  NPP NMS")
> >>         field(INPB, "10")
> >>         field(EGU,  "Counts")
> >>         field(HOPR, "100")
> >>         field(HIHI, "80")
> >>         field(HIGH, "60")
> >>         field(LOW,  "40")
> >>         field(LOLO, "20")
> >>         field(HHSV, "MAJOR")
> >>         field(HSV,  "MINOR")
> >>         field(LSV,  "MINOR")
> >>         field(LLSV, "MAJOR")
> >> }
> >>
> >>
> >>
> >>
> >> *(2) ioc-b34-cd11:**==> EPICS R3-14-11*
> >> Current RTOS:
> >>
> afsnfs2:/afs/slac:/package/rtems/4.9.4/target/rtems_p0/ssrlApps_p3/powerp
> >> c-rtems/beatnik/bin/rtems.ralf
> >>
> >> Current EPICS:
> >> Cexp@ioc-b34-cd11>coreRelease()
> >>
> #########################################################################
> >> ###
> >> ## EPICS R3.14.11 $R3-14-11$ $2009/08/28 18:47:36$
> >> ## EPICS Base built Mar 15 2010
> >>
> #########################################################################
> >> ###
> >>
> >> Purpose: Act as a Channel Access Server and provide a record to be
> >> consumed by  "ioc-b34-bd01"
> >> The following record comes from the "dbExample1.db"
> >>
> >> record(ai, "$(user):aiExample")
> >> {
> >>         field(DESC, "Analog input")
> >>         field(INP, "$(user):calcExample.VAL  NPP NMS")
> >>         field(EGUF, "10")
> >>         field(EGU, "Counts")
> >>         field(HOPR, "10")
> >>         field(LOPR, "0")
> >>         field(HIHI, "8")
> >>         field(HIGH, "6")
> >>         field(LOW, "4")
> >>         field(LOLO, "2")
> >>         field(HHSV, "MAJOR")
> >>         field(HSV, "MINOR")
> >>         field(LSV, "MINOR")
> >>         field(LLSV, "MAJOR")
> >> }
> >>
> >>
> =========================================================================
> >> ============
> >> We will now restart the EPICS IOC on ioc-b34-bd01 via the following
> >> methods:
> >>
> >> (1) AC Power Cycle
> >> (2)  bsp_reset()
> >> (3)  epicsExit()
> >>
> =========================================================================
> >> =============
> >> Initial State before the AC Power Cycle:
> >>
> *************************************************************************
> >> ***********************
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect",2)
> >>     connected ioc-b34-cd11.slac.stanford.edu:5066
> >> CA:Client:caReconnect.INPA CA:Server:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >>
> *************************************************************************
> >> ************************
> >>
> >> Okay, now we introduce a power cycle on ioc-b34-bd01:
> >> Scenario (1): Recovered perfectly !!
> >>
> *************************************************************************
> >> ************************
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect",2)
> >>     connected ioc-b34-cd11.slac.stanford.edu:5066
> >> CA:Client:caReconnect.INPA CA:Server:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >>
> *************************************************************************
> >> ************************
> >>
> >> Okay, now we introduce "bsp_reset()"
> >> Can't test because of:
> >> Tsi148 Inbound Ports:
> >> Port  VME-Addr   Size       PCI-Adrs   Mode:
> >> 0:    0x90000000 0x1FE00000 0x00000000 A32, PGM, DAT, SUP, USR, MBLT,
> BLT
> >> vmeTsi148 IRQ manager: looking for registers on VME...
> >> Trying to find CSR on VME...
> >>
> >>
> >> Okay, we introduce "epicsExit()"
> >>
> >> Cexp@ioc-b34-bd01>epicsExit()
> >> bsp_cleanup(): RTEMS terminated -- no way back to MotLoad so I reset
> the
> >> card
> >> Printing a stack trace for your convenience :-)
> >>
> >> 0x000DE744--> 0x000DF234--> 0x000DFE78--> 0x000DFE44--> 0x0000322C
> >>
> >> Copyright Motorola Inc. 1999-2005, All Rights Reserved
> >> MOTLoad RTOS Version 2.0,  PAL Version 1.1 RM04
> >> Wed Jul 20 15:07:15 MST 2005
> >> vmeTsi148 IRQ manager: looking for registers on VME...
> >> Trying to find CSR on VME...
> >>
> >> CSR on VME keeps messing up the test.  I have to somehow resolve that
> >> first.
> >> May the IOC or VME crate is bad?
> >>
> >> *********************************************************
> >> Okay, move the ioc-b34-bd01" to a crate in my office
> >> IOC Reboots perfectly with no CSR messages.  So, must be
> >> the VME crate: Report to Mike Harms and Dusatko for resolution.
> >> *********************************************************
> >>
> >> Back to case (2)
> >> Cexp@ioc-b34-bd01>bsp_reset()
> >> Printing a stack trace for your convenience :-)
> >>
> >> 0x000DE744--> 0x000DF234--> 0x00009344--> 0x0000FBA0--> 0x0001071C
> >> 0x0000836C--> 0x000052C0--> 0x00152354--> 0x001522D4
> >>
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect",2)
> >>     connected ioc-b34-cd11.slac.stanford.edu:5066
> >> CA:Client:caReconnect.INPA CA:Server:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >>
> >> Case 2: Is perfect!!
> >>
> >> Now, case (3)
> >> Cexp@ioc-b34-bd01>epicsExit()
> >> bsp_cleanup(): RTEMS terminated -- no way back to MotLoad so I reset
> the
> >> card
> >> Printing a stack trace for your convenience :-)
> >>
> >> 0x000DE744--> 0x000DF234--> 0x000DFE78--> 0x000DFE44--> 0x0000322C
> >>
> >> By the way, I notice that this calls "bsp_cleanup()"
> >> Till what does that "bsp_cleanup()" actually, do?
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect",2)
> >>     connected ioc-b34-cd11.slac.stanford.edu:5066
> >> CA:Client:caReconnect.INPA CA:Server:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >>
> >> Case (3) is also perfect!!
> >>
> >> So, I see absolutely no difference in the 3 cases with respect to CA
> >> connection recovery.
> >>
> >> The record on ioc-b34-bd01 is always able to reconnect to a record
> >> on a R3.14.11 IOC (e.g. ioc-b34-cd11) in all cases
> >>
> >>
> *************************************************************************
> >> *****
> >>
> *************************************************************************
> >> *****
> >>
> *************************************************************************
> >> *****
> >> Now, I guess we should repeat this test for all three cases with an
> >> R3-14-8-2  EPICS IOC
> >> Since, this is actually the scenario that we have in production.  :)
> >>
> >> Okay, we will now use "ioc-b34-ev02" as our EPICS IOC Channel Access
> >> Server:
> >> *
> >> (3) ioc-b34-ev02:==> EPICS R3-14-8-2*
> >> Current RTOS:
> >> afsnfs2:/afs/slac:package/rtems/4.9.3/target/ssrlApps/powerpc-
> >> rtems/beatnik/bin/rtems.ralf
> >>
> >> Current EPICS:
> >> Cexp@ioc-b34-ev02>coreRelease()
> >>
> #########################################################################
> >> ###
> >> ###  EPICS IOC CORE built on Oct  5 2010
> >> ###  EPICS R3.14.8.2-LCLS_6 $R3-14-8-2$ $2006/01/06 15:55:13$
> >>
> #########################################################################
> >> ###
> >>
> >> Purpose: Act as a Channel Access Server and provide a record to be
> >> consumed by  "ioc-b34-bd01"
> >> The following record comes from the "dbExample1.db"
> >>
> >> record(ai, "$(user):aiExample")
> >> {
> >>         field(DESC, "Analog input")
> >>         field(INP, "$(user):calcExample.VAL  NPP NMS")
> >>         field(EGUF, "10")
> >>         field(EGU, "Counts")
> >>         field(HOPR, "10")
> >>         field(LOPR, "0")
> >>         field(HIHI, "8")
> >>         field(HIGH, "6")
> >>         field(LOW, "4")
> >>         field(LOLO, "2")
> >>         field(HHSV, "MAJOR")
> >>         field(HSV, "MINOR")
> >>         field(LSV, "MINOR")
> >>         field(LLSV, "MAJOR")
> >> }
> >>
> >>
> >>
> >>
> >>
> >> Also, for ioc-b34-bd01 we will add another record so that we can test
> >> EPICS R3-14-8-2 and R3-14-11 Channel Access Servers concurrently
> >>
> >> record(calc, "$(user):caReconnect2")
> >> {
> >>         field(DESC, "Read from ioc-b34-ev02")
> >>         field(SCAN, "1 second")
> >>         field(CALC, "(A*B)")
> >>         field(INPA, "CA:Server2:aiExample.VAL  NPP NMS")
> >>         field(INPB, "100")
> >>         field(EGU,  "Counts")
> >>         field(HOPR, "1000")
> >>         field(HIHI, "800")
> >>         field(HIGH, "600")
> >>         field(LOW,  "400")
> >>         field(LOLO, "200")
> >>         field(HHSV, "MAJOR")
> >>         field(HSV,  "MINOR")
> >>         field(LSV,  "MINOR")
> >>         field(LLSV, "MAJOR")
> >> }
> >> ===========================================
> >>
> >> Okay, now we perform the following tests again:
> >> ********************************************************
> >> (1) AC Power Cycle
> >> (2)  bsp_reset()
> >> (3)  epicsExit()
> >> ********************************************************
> >>
> >>
> >> Case (1):  Is perfect !!
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect2",2)
> >>     connected ioc-b34-ev02.slac.stanford.edu:5066
> >> CA:Client:caReconnect2.INPA CA:Server2:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >>
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect",2)
> >>     connected ioc-b34-cd11.slac.stanford.edu:5066
> >> CA:Client:caReconnect.INPA CA:Server:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >>
> >>
> >> Case(2):  Is perfect!!
> >> Cexp@ioc-b34-bd01>bsp_reset()
> >> Printing a stack trace for your convenience :-)
> >>
> >> 0x000DE744--> 0x000DF234--> 0x00009344--> 0x0000FBA0--> 0x0001071C
> >> 0x0000836C--> 0x000052C0--> 0x00152354--> 0x001522D4
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect2",2)
> >>     connected ioc-b34-ev02.slac.stanford.edu:5066
> >> CA:Client:caReconnect2.INPA CA:Server2:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect",2)
> >>     connected ioc-b34-cd11.slac.stanford.edu:5066
> >> CA:Client:caReconnect.INPA CA:Server:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >> Cexp@ioc-b34-bd01>
> >>
> >> Case (3):  Is also perfect!!
> >> Cexp@ioc-b34-bd01>epicsExit()
> >> bsp_cleanup(): RTEMS terminated -- no way back to MotLoad so I reset
> the
> >> card
> >> Printing a stack trace for your convenience :-)
> >>
> >> 0x000DE744--> 0x000DF234--> 0x000DFE78--> 0x000DFE44--> 0x0000322C
> >>
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect2",2)
> >>     connected ioc-b34-ev02.slac.stanford.edu:5066
> >> CA:Client:caReconnect2.INPA CA:Server2:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >> Cexp@ioc-b34-bd01>dbcar("CA:Client:caReconnect",2)
> >>     connected ioc-b34-cd11.slac.stanford.edu:5066
> >> CA:Client:caReconnect.INPA CA:Server:aiExample.VAL
> >> ncalinks 1 not connected 0 no_read_access 0 no_write_access 0
> >>  nDisconnect 0 nNoWrite 0
> >> 0x00000000 (0)
> >> Cexp@ioc-b34-bd01>
> >>
> >>
> >>
> >>
> >> Cheers
> >> Ernest
> >>
> >>
> >>
> >> Till Straumann wrote:
> >>> On 02/03/2011 10:29 AM, Ernest L. Williams Jr. wrote:
> >>>
> >>>> Till Straumann wrote:
> >>>>
> >>>>> On 01/31/2011 11:04 PM, Ernest L. Williams Jr. wrote:
> >>>>>
> >>>>>> Hi Till,
> >>>>>>
> >>>>>> What is the current status of the CA problem reported by Diane?
> >>>>>>
> >>>>> This is not a high priority ATM (given that the issue is probably
> >>>>> understood and I'm busy with Tom Himel's stuff).
> >>>>>
> >>>> Agreed.
> >>>>
> >>>>
> >>>>
> >>>>> Please try if the problem persists if you use epicsExit() to reboot
> >>>>> a 3.14.11 client IOC.
> >>>>>
> >>>> I will try this one today.
> >>>>
> >>> OK
> >>>
> >>>
> >>>>> https://bugs.launchpad.net/epics-base/+bug/541269
> >>>>>
> >>>>> (see last comment)
> >>>>>
> >>>> Where do you want me to install the new PMC676 NIC?
> >>>>
> >>> On the CA server, please.
> >>>
> >>> T.
> >>>
> >>>>
> >>>>> T.
> >>>>>
> >>>>>
> >>>>>> I have received the PMC-based NICs today if you need them.
> >>>>>>
> >>>>>> In any case, the LINAC Upgrade is back to high priority.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Ernest
> >>>>>>
> >>>
> >
> >

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-02-08:

#7

Hello again Dirk,

I should add that this has been a good discussion. It's certainly possible that changes could be made to address this type of issue. However, I need to be conservative about such changes because I believe that the R3.14 aggregate EPICS system is, by design, much more reliable than the R3.13 aggregate system.

How often do these type of reboot reconnecting issue occur? What is the typical delay before the system reconnects in such situations? What is the maximum delay before a system reconnected in such situations.

I will update the bug report with your answers.

mdavidsaver (mdavidsaver) on 2015-12-03

tags:	added: vxworks
Changed in epics-base:
assignee:	nobody → Andrew Johnson (anj)

Andrew Johnson (anj) on 2016-10-10

Changed in epics-base:
status:	Incomplete → Won't Fix

EPICS Base

DB CA links are slow to reconnect when IOC is rebooted (see mantis-126)

Bug Description

Other bug subscribers

Remote bug watches