radiation damaged IOC causes other IOCs to crash

Bug #541295 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Invalid
Low
Unassigned

Bug Description

From Rolf Keitel,

last night we had another event, where a few IOCs went down.
Unfortunately, again I have not stack traces.
To give you some context:

the IOC with 142.90.132.24 is a MV162, which is sometimes exposed to more radiation than was anticipated (we working on moving this IOC, but this will take a while, so we have to limp along). This IOC showed memory errors (we run a background checker looking at SRAM) and went down.

As a consequence, two other IOCs, which had CA connections to the MV162 went down. Both are Intel architecture, first is a Pentium, second is a PC104.
VxWorks 5.5 EPICS 3.13.10
Attached is a console output from these IOCs. Is there anything meaningful to you?

- rolf -

************* FIRST IOC (Pentium)
*********************************
May 30, 2006 02:34:59.046020138
../iocinf.c: bad UDP msg from localhost:1029
../iocinf.c: bad UDP msg from localhost:1032
../iocinf.c: bad UDP msg from localhost:1035 May 30, 2006 02:44:58.906059823 May 30, 2006 02:54:58.794915970 May 30, 2006 03:04:58.669456944 May 30, 2006 03:14:58.544155516 May 30, 2006 03:24:58.419615856 May 30, 2006 03:34:58.277683417 dbCa:exceptionCallback stat "Network connection lost" channel "unknown" context "142.90.132.24:9004"
 nativeType DBR_invalid requestType DBR_invalid nativeCount 0 requestCount 0 noReadAccess noWriteAccess May 30, 2006 03:44:58.169008218 May 30, 2006 03:54:58.043263949 May 30, 2006 04:04:57.903054195
../iocinf.c: bad UDP msg from localhost:1029
../iocinf.c: bad UDP msg from localhost:1032
../iocinf.c: bad UDP msg from localhost:1035 May 30, 2006 04:14:57.793193687 dbCa:exceptionCallback stat "Network connection lost" channel "unknown" context "142.90.132.24:9004"
 nativeType DBR_invalid requestType DBR_invalid nativeCount 0 requestCount 0 noReadAccess noWriteAccess May 30, 2006 04:24:57.667663959 May 30, 2006 04:34:57.530041366 dbCa:exceptionCallback stat "Network connection lost" channel "unknown" context "142.90.132.24:9004"
 nativeType DBR_invalid requestType DBR_invalid nativeCount 0 requestCount 0 noReadAccess noWriteAccess CA.Client.Diagnostic..............................................
    Message: "Network connection lost"
    Severity: "Warning" Context: "142.90.132.24:9004"
    Source File: ../iocinf.c Line Number: 1529 ..................................................................
asCa:exceptionCallback stat Network connection lost channel unknown
CAC: unexpected select fail: 851971=S_iosLib_INVALID_FILE_DESCRIPTOR
0xf9211dc (CA_UDP): CAS: UDP send to "142.90.132.24:1031" failed because "S_errno_EHOSTDOWN"

Page Fault
Page Dir Base : 0x0fef6000
Esp0 0x0f921064 : 0x0f92119c, 0x0f92118c, 0x0f92110c, 0x00003fb0 Esp0 0x0f921074 : 0x00000000, 0x00000000, 0x000000d8, 0x003ae434 Program Counter : 0x0fc409f9
Code Selector : 0x00000008
Eflags Register : 0x00010292
Error Code : 0x00000003
Page Fault Addr : 0x0fef886d
Task: 0xf9211dc "CA_UDP"

Page Fault
Page Dir Base : 0x0fef6000
Esp0 0x0fe1a454 : 0x0f91e1ec, 0x00000001, 0x0fe1a48c, 0x0034ff6f Esp0 0x0fe1a464 : 0x7852e10f, 0x0fdc7600, 0x00000044, 0x003b5673 Program Counter : 0x003b3530
Code Selector : 0x00000008
Eflags Register : 0x00010206
Error Code : 0x00000000
Page Fault Addr : 0x7852e163
Task: 0xfe1a6ec "tNetTask"

Page Fault
Page Dir Base : 0x0fef6000
Esp0 0x0fd2c02c : 0x0fd2c06c, 0x0fc1f8e1, 0x0f97cb64, 0x0fa5a794 Esp0 0x0fd2c03c : 0x00000008, 0x0fcb1a7c, 0x0f97cb64, 0x00000000 Program Counter : 0x0fc1e400
Code Selector : 0x00000008
Eflags Register : 0x00010282
Error Code : 0x00000000
Page Fault Addr : 0xac0fd328
Task: 0xfd2c1fc "cbLow"
0xfb2254c ('eRnacvTask): callbackRequest ring buffer full

****************** SECOND IOC (PC104) *************************************

May 30, 2006 04:34:45.009088336
0x3a5ce18 (CA_UDP): CAS: UDP send to "142.90.132.24:1031" failed because "S_errno_EHOSTDOWN"

Page Fault
Page Dir Base : 0x03fd6000
Esp0 0x03a50da8 : 0x00000202, 0x0fda101b, 0x03a2c748, 0x03a50de4
Esp0 0x03a50db8 : 0x001acf2b, 0xf3c8f7bd, 0x03a347a8, 0x00000010
Program Counter : 0x00138fa9
Code Selector : 0x00000008
Eflags Register : 0x00010202
Error Code : 0x03030000
Page Fault Addr : 0xf3c8f7bd
Task: 0x3a50ed4 "RD_dbCaLink"

asCaTask: A call to "assert (sendCnt<=piiu->send.max_msg)" failed in ../iocinf.c at 956
Please send a copy of the output from "tt (0x3ac45f8)" and a copy of this message
to the author or "<email address hidden>"
This problem oc
curred in "@(#)VersPage Faultion R3.13.
10 $2004/04/15 13:3Page Dir Base : 0x06:02$"
3fd6000
Esp0 0x03a5cca0 : 0x03a5cd78, 0x03a5cdc8, 0x03a5cd48, 0x00003fb0
Esp0 0x03a5ccb0 : 0x00000000, 0x00000000, 0x00000018, 0x00183064
Program Counter : 0x03c693b8
Code Selector : 0x00000008
Eflags Register : 0x00010216
Error Code : 0x00000003
Page Fault Addr : 0x03fda000
Task: 0x3a5ce18 "CA_UDP"

Page Fault
Page Dir Base : 0x03fd6000
Esp0 0x03ae7680 : 0x03c65ffe, 0xb0030000, 0xffffffff, 0x00000000
Esp0 0x03ae7690 : 0x00000061, 0x03d08838, 0x03d08b0c, 0x00000000
Program Counter : 0x0013b9e3
Code Selector : 0x00000008
Eflags Register : 0x00010246
Error Code : 0x03c50000
Page Fault Addr : 0xb0030004
Task: 0x3ae787c "scan60"

Page Fault
Page Dir Base : 0x03fd6000
Esp0 0x03e6e014 : 0x00144f70, 0x03f88348, 0x03e6e04c, 0x00171e74
Esp0 0x03e6e024 : 0x03f8d0ec, 0x03f8c24c, 0x03f8c270, 0x5f0ca000
Program Counter : 0x001450c0
Code Selector : 0x00000008
Eflags Register : 0x00010246
Error Code : 0x03e60000
Page Fault Addr : 0xe603e605
Task: 0x3e6e4fc "tNetTask"

Original Mantis Bug: mantis-262
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=262

Tags: 3.13 cleanup
Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (3.7 KiB)

Very sick IOCs. One could speculate that somehow a very badly damaged IP frame was not detected during the CRC checksum validation because the CRC checksum detects only a finite number of damaged bits. Carrying that scenario further perhaps the badly damaged frame caused either the IP kernel to fail catastrophically or else a similar fate befell the CA server. At some point I added a number of robustness upgrades to R3.13 to make the CA server better at surviving in that type of situation. Since you have R3.13.10 you should have that upgrade (which was produced in response to bad behavior I saw here on the LEDA project when Ethernet 10/100 auto-negotiation failed between the switch and the IOC).

The next time this occurs please run a tt <task id> on the afflicted threads. For example to look at "tNetTask" below type "tt 0x3e6e4fc" and email the result. This produces stack trace information which system programmers can frequently utilize to determine what was going on when the system failed. Understanding what exactly occurred can be essential when creating a fix.

> Page Fault
> Page Dir Base : 0x03fd6000
> Esp0 0x03e6e014 : 0x00144f70, 0x03f88348, 0x03e6e04c, 0x00171e74 Esp0
> 0x03e6e024 : 0x03f8d0ec, 0x03f8c24c, 0x03f8c270, 0x5f0ca000 Program
> Counter : 0x001450c0
> Code Selector : 0x00000008
> Eflags Register : 0x00010246
> Error Code : 0x03e60000
> Page Fault Addr : 0xe603e605
> Task: 0x3e6e4fc "tNetTask"

The "tNetTask" does not, to the best of my knowledge, run any EPICS code. So I think I understand that either EPICS is corrupting the IP kernel's data structures or else a rogue IP message on the LAN has caused the vxWorks IP kernel to self-mutilate itself.

> CAC: unexpected select fail: 851971=S_iosLib_INVALID_FILE_DESCRIPTOR

Possibly means that the OS or CA server's data structures got clobbered.

> asCaTask: A call to "assert (sendCnt<=piiu->send.max_msg)" failed

Possibly a sign of corruption of the ca server's data structures

> Page Fault
> Page Dir Base : 0x0fef6000
> Esp0 0x0fd2c02c : 0x0fd2c06c, 0x0fc1f8e1, 0x0f97cb64, 0x0fa5a794 Esp0
> 0x0fd2c03c : 0x00000008, 0x0fcb1a7c, 0x0f97cb64, 0x00000000 Program
> Counter : 0x0fc1e400
> Code Selector : 0x00000008
> Eflags Register : 0x00010282
> Error Code : 0x00000000
> Page Fault Addr : 0xac0fd328
> Task: 0xfd2c1fc "cbLow"

Here we have a thread that does not even use the network going south in response to a network initiated failure. Sounds suspiciously like generalized corruption.

> CA.Client.Diagnostic..............................................
> Message: "Network connection lost"
> Severity: "Warning" Context: "142.90.132.24:9004"
> Source File: ../iocinf.c Line Number: 1529
> ..................................................................
> asCa:exceptionCallback stat Network connection lost channel unkno

Symptom of IP kernel failure (this thread didn’t page fault yet it has detected that use of its socket produces failed status).

> 0xf9211dc (CA_UDP): CAS: UDP send to "142.90.132.24:1031" failed
> because "S_errno_EHOSTDOWN"

Symptom of IP kernel failure (this thread didn’t page fault yet it has detected that use of its socket pro...

Read more...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

It is interesting that both IOCs that crashed are Pentium based. Do they have the same network interface driver? If so, an upgrade to the WRS latest might be a good bet.

Andrew Johnson (anj)
Changed in epics-base:
status: New → Incomplete
Andrew Johnson (anj)
Changed in epics-base:
importance: High → Low
tags: added: cleanup
Changed in epics-base:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.