A call to 'assert (item.pList == this)' failed in ../../../../src/cas/generic/st/ioBlocked.cc line 112

Bug #541377 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
Jeff Hill

Bug Description

We have been running the new gateway 2.0 on epics 3.14.8.2 for a month or more with good results. It has been stable even with large transfers.

However last evening a problem occurred with the gateway and it had to be restarted. Unfortunately the only information I have is the following gateway log.
Can anyone hazard a guess as to what was going on ?

We are running the gateway on Redhat Enterprise Linux 4.
Also we are running the alarm handler through the gateway too which I understand from previous Techtalk emails is a bad thing (DBE_VALUE v DBE_ALARM events subscription conflict) and something we have to address. Could this have caused the problem?

Pete Leicester
Diamond Light Source

------------------------------------------

Feb 27 09:01:02 PV Gateway Version 2.0.0.0 [Jan 9 2006 11:30:43]
EPICS 3.14.8.2 PID=2598 ServerPID=2597
EPICS_CA_ADDR_LIST=172.23.255.255
EPICS_CA_AUTO_ADDR_LIST=NO
EPICS_CA_SERVER_PORT=5064
EPICS_CA_MAX_ARRAY_BYTES=3000000
EPICS_CAS_INTF_ADDR_LIST=172.23.194.129
EPICS_CAS_SERVER_PORT=5064
EPICS_CAS_IGNORE_ADDR_LIST=172.23.194.129
Running as user gate on host serv0013.cs.diamond.ac.uk
Statistics PV prefix is CtoP:serv0013
Feb 28 14:42:43 Warning: Virtual circuit unresponsive 172.23.248.87:5064
Unexpected problem with CA circuit to server "172.23.248.87:5064" was "No route to host" - disconnecting

Feb 28 14:58:03 !!! Errlog message received (message is above)
Feb 28 14:58:03 Warning: Virtual circuit disconnect 172.23.248.87:5064
Feb 28 15:36:41 Warning: Virtual circuit unresponsive 172.23.248.60:5064
CAS: TCP socket send to "172.23.198.8:32805" failed because "No route to host"

Feb 28 15:50:15 !!! Errlog message received (message is above)
Unexpected problem with CA circuit to server "172.23.248.60:5064" was "No route to host" - disconnecting

Feb 28 15:52:01 !!! Errlog message received (message is above)
Feb 28 15:52:01 Warning: Virtual circuit disconnect 172.23.248.60:5064
Feb 28 16:15:46 Warning: Virtual circuit unresponsive 172.23.248.109:5064
CAS: TCP socket send to "172.23.198.8:32813" failed because "No route to host"

Feb 28 16:16:40 !!! Errlog message received (message is above)
Unexpected problem with CA circuit to server "172.23.248.109:5064" was "No route to host" - disconnecting

Feb 28 16:31:07 !!! Errlog message received (message is above)
Feb 28 16:31:07 Warning: Virtual circuit disconnect 172.23.248.109:5064
Mar 01 10:41:06 Warning: Virtual circuit unresponsive 172.23.248.25:5064
Mar 01 10:41:07 Warning: Virtual circuit unresponsive 172.23.248.27:5064
Mar 01 10:41:09 Warning: Virtual circuit unresponsive 172.23.248.83:5064
Mar 01 10:41:11 Warning: Virtual circuit unresponsive 172.23.248.195:5064
Mar 01 10:41:15 Warning: Virtual circuit unresponsive 172.23.248.164:5064
Mar 01 10:41:15 Warning: Virtual circuit unresponsive 172.23.248.24:5064
Mar 01 10:41:16 Warning: Virtual circuit unresponsive 172.23.248.144:5064
Unexpected problem with CA circuit to server "172.23.248.25:5064" was "No route to host" - disconnecting

Mar 01 10:56:27 !!! Errlog message received (message is above)
Mar 01 10:56:27 Warning: Virtual circuit disconnect 172.23.248.25:5064
Unexpected problem with CA circuit to server "172.23.248.83:5064" was "No route to host" - disconnecting

Mar 01 10:56:29 !!! Errlog message received (message is above)
Mar 01 10:56:29 Warning: Virtual circuit disconnect 172.23.248.83:5064
CAS: client 172.23.198.11:32812 disconnected because "No route to host"

Mar 01 10:56:32 !!! Errlog message received (message is above)
Unexpected problem with CA circuit to server "172.23.248.195:5064" was "No route to host" - disconnecting

Mar 01 10:56:32 !!! Errlog message received (message is above)
Mar 01 10:56:32 Warning: Virtual circuit disconnect 172.23.248.195:5064
Unexpected problem with CA circuit to server "172.23.248.164:5064" was "No route to host" - disconnecting

Mar 01 10:56:36 !!! Errlog message received (message is above)
Mar 01 10:56:36 Warning: Virtual circuit disconnect 172.23.248.164:5064
Unexpected problem with CA circuit to server "172.23.248.24:5064" was "No route to host" - disconnecting

Mar 01 10:56:36 !!! Errlog message received (message is above)
Mar 01 10:56:36 Warning: Virtual circuit disconnect 172.23.248.24:5064
Unexpected problem with CA circuit to server "172.23.248.144:5064" was "No route to host" - disconnecting

Mar 01 10:56:37 !!! Errlog message received (message is above)
Mar 01 10:56:37 Warning: Virtual circuit disconnect 172.23.248.144:5064
Unexpected problem with CA circuit to server "172.23.248.27:5064" was "No route to host" - disconnecting

Mar 01 10:56:44 !!! Errlog message received (message is above)
Mar 01 10:56:44 Warning: Virtual circuit disconnect 172.23.248.27:5064
CAS: client 172.23.200.84:1090 disconnected because "No route to host"

Mar 01 10:57:00 !!! Errlog message received (message is above)
Mar 01 16:54:11 Warning: Virtual circuit unresponsive 172.23.248.24:5064
Mar 01 16:54:32 Warning: Virtual circuit unresponsive 172.23.248.27:5064
Mar 01 16:54:40 Warning: Virtual circuit unresponsive 172.23.248.83:5064
Unexpected problem with CA circuit to server "172.23.248.27:5064" was "Connection reset by peer" - disconnecting

Mar 01 16:54:54 !!! Errlog message received (message is above)
Mar 01 16:54:54 Warning: Virtual circuit disconnect 172.23.248.27:5064
Mar 01 16:54:57 Warning: Virtual circuit unresponsive 172.23.248.195:5064
Unexpected problem with CA circuit to server "172.23.248.24:5064" was "Connection reset by peer" - disconnecting

Mar 01 16:54:58 !!! Errlog message received (message is above)
Mar 01 16:54:58 Warning: Virtual circuit disconnect 172.23.248.24:5064
Unexpected problem with CA circuit to server "172.23.248.195:5064" was "Connection reset by peer" - disconnecting

Mar 01 16:55:18 !!! Errlog message received (message is above)
Mar 01 16:55:18 Warning: Virtual circuit disconnect 172.23.248.195:5064
Unexpected problem with CA circuit to server "172.23.248.83:5064" was "Connection reset by peer" - disconnecting

Mar 01 16:55:26 !!! Errlog message received (message is above)
Mar 01 16:55:26 Warning: Virtual circuit disconnect 172.23.248.83:5064
Unexpected problem with CA circuit to server "172.23.252.107:5064" was "Connection reset by peer" - disconnecting

Mar 01 17:06:40 !!! Errlog message received (message is above)
Mar 01 17:06:40 Warning: Virtual circuit disconnect 172.23.252.107:5064
Unexpected problem with CA circuit to server "172.23.248.195:5064" was "Connection reset by peer" - disconnecting

Mar 01 17:18:05 !!! Errlog message received (message is above)
Mar 01 17:18:05 Warning: Virtual circuit disconnect 172.23.248.195:5064
Unexpected problem with CA circuit to server "172.23.248.195:5064" was "Connection reset by peer" - disconnecting

Mar 01 17:20:37 !!! Errlog message received (message is above)
Mar 01 17:20:37 Warning: Virtual circuit disconnect 172.23.248.195:5064

A call to "assert (item.pList == this)" failed in ../../../../src/cas/generic/st/ioBlocked.cc line 112.

Mar 01 18:30:04 !!! Errlog message received (message is above)
EPICS Release EPICS R3.14.8.2 $R3-14-8-2$ $2006/01/06 15:55:13$.

Mar 01 18:30:04 !!! Errlog message received (message is above)
Current time Wed Mar 01 2006 18:30:04.20652000.

Mar 01 18:30:04 !!! Errlog message received (message is above)
Please E-mail this message to Jeff Hill <email address hidden> or to <email address hidden>

Mar 01 18:30:04 !!! Errlog message received (message is above)
Calling epicsThreadSuspendSelf()

Mar 01 18:30:04 !!! Errlog message received (message is above)
Mar 01 19:28:11 PV Gateway Ending (SIGTERM)

------------------------------------------
Mr P J Leicester
Senior Software Engineer
Diamond Light Source Ltd.
Rutherford Appleton Laboratory
Chilton
Didcot
OX11 0QX

Tel: 01235 778478

Original Mantis Bug: mantis-348
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=348

Tags: cas 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

Our gateway just crashed and here is the log message, as instructed. Please let me know if you need more details.
thanks,
Judy Rock, et al

Jun 25 11:57:15 PV Gateway Version 2.0.3.0 [Mar 26 2009 15:41:28] EPICS 3.14.10-LCLS-2 PID=25341 ServerPID=25340
EPICS_CA_ADDR_LIST=172.21.40.21
EPICS_CA_AUTO_ADDR_LIST=NO
EPICS_CA_SERVER_PORT=5070
EPICS_CA_MAX_ARRAY_BYTES=80000000
EPICS_CAS_INTF_ADDR_LIST=lcls-daemon3
EPICS_CAS_SERVER_PORT=5068
EPICS_CAS_IGNORE_ADDR_LIST=Not specified Running as user laci on host lcls-daemon3 Statistics PV prefix is GWERA call to "assert (item.pList == this)" failed in ../../../../src/cas/generic/st/ioBlocked.cc line 112.

Jun 29 09:26:07 !!! Errlog message received (message is above) EPICS Release EPICS R3.14.10-LCLS-2 $R3-14-10$ $2008/10/27 19:39:04$.

Jun 29 09:26:07 !!! Errlog message received (message is above) Current time Mon Jun 29 2009 09:26:07.477543000.

Jun 29 09:26:07 !!! Errlog message received (message is above) Please E-mail this message to Jeff Hill <email address hidden> or to <email address hidden>

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Not in the Pvgateways that I ran for PEP. They might not be so useful for comparison because there were an older gateway and epics version on a different (solaris) platform. I stopped closely tracking those gateways when we turned off PEP about 15 months ago.

-----Original Message-----
From: Rock, Judith E.
Sent: Monday, June 29, 2009 10:52 AM
To: 'Jeff Hill'
Cc: Zhou, Jingchen; Lahey, Terri E.; Boyes, Matthew; Allison, Stephanie
Subject: RE: Gateway internal error and crash, and here's the log message

Hi -
We just brought this gateway up in production on 6/25 - these are its first log entries. I'm not aware of it happening with other gateways - Terri and Steph, have you seen the error before (see below)?
thanks,
Judy

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I am starting to suspect that this is the scenario of the crash
---------------------------------------------------------------
o service returns S_casApp_postponeAsyncIO from the IO interface
o casCoreClient is installed into IO blocked list of the relevant PV
o another TCP frame arrives
o casStreamOS::processInput() is called
o the request protocol is reprocessed
o this time the service unexpectedly returns S_casApp_success from the IO interface despite the fact that no asynchronous IO has completed since the last call to this interface
o the server lib goes on to a subsequent request
o service returns S_casApp_postponeAsyncIO from the IO interface of a different PV
o the casCoreClient tries to simultaneously be in the IO blocked list of two different PVs => assert fail

edited on: 2009-07-09 11:08

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I committed this fix

cvs diff -wb (in directory C:\hill\R3.14.dll_hell_fix\epics\base\src\cas\generic\)
Index: casStrmClient.cc
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/cas/generic/casStrmClient.cc,v
retrieving revision 1.92.2.20
diff -u -b -w -b -r1.92.2.20 casStrmClient.cc
--- casStrmClient.cc 9 Jul 2009 15:27:41 -0000 1.92.2.20
+++ casStrmClient.cc 9 Jul 2009 17:20:53 -0000
@@ -117,6 +117,13 @@
     epicsGuard < casClientMutex > guard ( this->mutex );
  int status = S_cas_success;

+ // protect against service returning s_casApp_success when it
+ // returned S_casApp_postponeAsyncIO before, but no
+ // asyn IO completed since the last attempt
+ if ( this->isBlocked () ) {
+ return S_casApp_postponeAsyncIO;
+ }
+
     try {

         // drain message that does not fit
Index: ioBlocked.h
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/cas/generic/ioBlocked.h,v
retrieving revision 1.1.2.1
diff -u -b -w -b -r1.1.2.1 ioBlocked.h
--- ioBlocked.h 5 Aug 2005 20:41:45 -0000 1.1.2.1
+++ ioBlocked.h 9 Jul 2009 17:22:52 -0000
@@ -36,6 +36,7 @@
 public:
  ioBlocked ();
  virtual ~ioBlocked ();
+ bool isBlocked ();
 private:
  class ioBlockedList * pList;
  virtual void ioBlockedSignal ();
@@ -52,5 +53,10 @@
  ioBlockedList & operator = ( const ioBlockedList & );
 };

+inline bool ioBlocked :: isBlocked ()
+{
+ return this->pList != NULL;
+}
+
 #endif // ioBlockedh

Index: st/casStreamOS.cc
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/cas/generic/st/casStreamOS.cc,v
retrieving revision 1.38.2.3
diff -u -b -w -b -r1.38.2.3 casStreamOS.cc
--- st/casStreamOS.cc 9 Jul 2009 15:27:41 -0000 1.38.2.3
+++ st/casStreamOS.cc 9 Jul 2009 16:21:53 -0000
@@ -217,6 +217,7 @@
 //
 epicsTimerNotify::expireStatus casStreamIOWakeup::expire ( const epicsTime & /* currentTime */ )
 {
+ assert ( this->pOS );
     casStreamOS & tmpOS = *this->pOS;
     this->pOS = 0;
  tmpOS.processInput();
Index: st/ioBlocked.cc
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/cas/generic/st/ioBlocked.cc,v
retrieving revision 1.10
diff -u -b -w -b -r1.10 ioBlocked.cc
--- st/ioBlocked.cc 12 Feb 2003 19:06:13 -0000 1.10
+++ st/ioBlocked.cc 9 Jul 2009 17:36:56 -0000
@@ -69,9 +69,7 @@
 //
 ioBlockedList::~ioBlockedList ()
 {
- ioBlocked *pB;
-
- while ( (pB = this->get ()) ) {
+ for ( ioBlocked * pB = this->get (); pB; pB = this->get () ) {
         pB->pList = NULL;
     }
 }
@@ -86,14 +84,13 @@
 void ioBlockedList::signal ()
 {
     tsDLList<ioBlocked> tmp;
- ioBlocked *pB;

     //
     // move all of the items onto tmp
     //
     tmp.add(*this);

- while ( (pB = tmp.get ()) ) {
+ for ( ioBlocked * pB = tmp.get (); pB; pB = tmp.get () ) {
         pB->pList = NULL;
         pB->ioBlockedSignal ();
     }

***** CVS exited normally with code 1 *****

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in r3.14.11

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.11 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.