local caput causes ioc crash on win32

Reported by Jeff Hill on 2011-02-11
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
EPICS Base
Medium
Jeff Hill

Bug Description

From Carsten Winkler:

Problem: softIoc.exe runs into fatal exception after caput call from local host.
                              When running the test, there's a 40% chance for this exception to occur.
                              Waiting between starting the softIoc and caput, and/or waiting between
caput calls does not change the behavior.

Error message: "softIoc has encountered a problem and needs to close. We are sorry for the
inconvience."

Error details: AppName: softioc.exe; AppVer:0.0.0.0; ModName: com.dll; ModVer: 3.14.12.0;
Offset: 0000a613

System: EPICS 3.14.12 plus all published patches (8 Feb. 2011), no local changes
(compiled with MVS 2010 professional - without any error)

Host: Windows XP SP3 @ Pentium 4 with 3.2GHz and 3GB RAM
                                                    AND
                             Windows XP SP3 @ VMWARE 3.5.0
                                                    AND
                              Windows XP SP3 @ VIRTUALBOX 3.2.12
                                                    AND
                              Windows XP SP3 @ KVM

Test setup: The following configuration files have been used:
     startDemo.bat:
                                 set EPICS_CA_ADDR_LIST=localhost
                                 set EPICS_CA_AUTO_ADDR_LIST=NO
                                 start DemoIOC.cmd
                                 caput demoHost:double1 3.141
                                 caput demoHost:double2 2.718
                                 caput demoHost:long 12345
     DemoIOC.cmd:
                                 softIoc.exe -D dbd\softIoc.dbd -d db\demo.db
     demo.db:
                                 record(ao, "demoHost:double1")
                                 {
                                     field(DESC, "Double output with range infos")
                                     field(EGU, "mm")
                                     field(HOPR, "10")
                                     field(LOPR, "0")
                                     field(HIHI, "8")
                                     field(HIGH, "6")
                                     field(LOW, "4")
                                     field(LOLO, "2")
                                     field(HHSV, "MAJOR")
                                     field(HSV, "MINOR")
                                     field(LSV, "MINOR")
                                     field(LLSV, "MAJOR")
                                  }
                                 record(ao, "demoHost:double2")
                                 {
                                     field(DESC, "Double output without range infos")
                                     field(EGU, "nm")
                                 }
                                 record(longout, "demoHost:long")
                                 {
                                     field(DESC, "Long output without range infos")
                                     field(EGU, "m")
                                 }

call stack:
Com.dll!ellDelete(ELLLIST * pList=0x00a22ed0, ELLNODE * pNode=0x00cf0d08) Zeile 82 + 0xb Bytes C
Com.dll!epicsParmCleanupWIN32(epicsThreadOSD * pParm=0x00cf0d08) Zeile 246 + 0x10 Bytes C
Com.dll!epicsWin32ThreadEntry(void * lpParameter=0x00cf0d08) Zeile 516 + 0x9 Bytes C
msvcr100d.dll!_callthreadstartex() Zeile 314 + 0xf Bytes C
msvcr100d.dll!_threadstartex(void * ptd=0x00cf1500) Zeile 297 C
kernel32.dll!_BaseThreadStart@8() + 0x37 Bytes

exception occurred here:
void ellDelete (ELLLIST *pList, ELLNODE *pNode)
{
     if (pList->node.previous == pNode)
         pList->node.previous = pNode->previous;
     else
         pNode->next->previous = pNode->previous; <== "pNode->next" is a NULL pointer in error case!
(s. memory map)

     if (pList->node.next == pNode)
         pList->node.next = pNode->next;
     else
         pNode->previous->next = pNode->next;

     pList->count--;

     return;
}
This function was called from "static void epicsParmCleanupWIN32 ( win32ThreadParam * pParm )" of
osdThread.c

memory maps:
- pList 0x00a22ed0 {node={...} count=23 } ELLLIST *
     - node {next=0x00a24728 previous=0x00cf0c60 } ELLNODE
         - next 0x00a24728 {next=0x00a26340 previous=0x00000000 } ELLNODE *
             - next 0x00a26340 {next=0x00adbd90 previous=0x00a24728 } ELLNODE *
[... 20 thread parameter blocks in this list without address 0x00cf0d08]
                    - next 0x00cf0c60 {next=0x00000000 previous=0x00cf0de0 } ELLNODE *
                        - next 0x00000000 {next=??? previous=??? } ELLNODE *
                            next CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden
                            previous CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden
         + previous 0x00cf0de0 {next=0x00cf0c60 previous=0x00cf0b40 } ELLNODE *
             + previous 0x00cf0b40 {next=0x00cf0de0 previous=0x00af1ac0 } ELLNODE *
[... 20 thread parameter blocks in this list without address 0x00cf0d08]
     count 23 int

- pNode 0x00cf0d08 {next=0x00000000 previous=0x00000000 } ELLNODE *
     + next 0x00000000 {next=??? previous=??? } ELLNODE *
         next CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden
         previous CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden
     + previous 0x00000000 {next=??? previous=??? } ELLNODE *
         next CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden
         previous CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden

0x00CF0C80 00 00 00 00 43 41 53 2d 65 76 65 6e 74 00 fd fd fd fd dd dd dd dd dd dd 09 00 0c 00 a1
01 0c 02 a0 12 ....CAS-event.ýýýýÝÝÝÝÝÝ....¡... .
0x00CF0CA2 cf 00 e8 0c cf 00 00 00 00 00 00 00 00 00 18 00 00 00 01 00 00 00 fd 26 00 00 fd fd fd
fd 00 00 00 00 Ï.è.Ï.................ý&..ýýýý....

This looks like there is a situation (with a 40% chance), in which the thread parameter block of a
CAS-event task is not added to the thread list when the client connects. When the clients
disconnects and the thread gets shut down, it tries to remove its parameter block from the thread
list by calling ellDelete() with a node that is not in the list. ellDelete() behaves fragile and
crashes the softIoc (by dereferencing a null pointer). This exception seems to occur only when caput
has been called from local host.

Related branches

Jeff Hill (johill-lanl) wrote :

this bug occurrs only on win32

Changed in epics-base:
assignee: nobody → Jeff Hill (johill-lanl)
importance: Undecided → Medium
status: New → Confirmed
tags: added: thread win32
Jeff Hill (johill-lanl) wrote :

committed this fix

>bzr diff
=== modified file 'src/libCom/osi/os/WIN32/osdThread.c'
--- src/libCom/osi/os/WIN32/osdThread.c 2011-01-15 01:00:02 +0000
+++ src/libCom/osi/os/WIN32/osdThread.c 2011-02-11 16:30:54 +0000
@@ -630,18 +630,21 @@
         free ( pParmWIN32 );
         return NULL;
     }
+
+ EnterCriticalSection ( & pGbl->mutex );
+ ellAdd ( & pGbl->threadList, & pParmWIN32->node );
+ LeaveCriticalSection ( & pGbl->mutex );

     wstat = ResumeThread ( pParmWIN32->handle );
     if (wstat==0xFFFFFFFF) {
+ EnterCriticalSection ( & pGbl->mutex );
+ ellDelete ( & pGbl->threadList, & pParmWIN32->node );
+ LeaveCriticalSection ( & pGbl->mutex );
         CloseHandle ( pParmWIN32->handle );
         free ( pParmWIN32 );
         return NULL;
     }

- EnterCriticalSection ( & pGbl->mutex );
- ellAdd ( & pGbl->threadList, & pParmWIN32->node );
- LeaveCriticalSection ( & pGbl->mutex );
-
     return ( epicsThreadId ) pParmWIN32;
 }

After manually patching osdThread.c (patchfile has corrupt line numbers :-( ) "softIoc.exe" works well.
But now "caput.exe" runs sometimes into exception! This exception is very hard to catch (1% of calls
in my case).
The exception occurs in following function: "src/ca/tcpiiu.cpp":

bool tcpiiu :: connectNotify (
    epicsGuard < epicsMutex > & guard, nciu & chan )
{
...
    if ( chan.channelNode::listMember == channelNode::cs_createRespPend ) {
        this->createRespPend.remove ( chan );
        this->subscripReqPend.add ( chan ); <==== exception occurs here!
...

"this->subscripReqPend.add ( chan );" is dereferencing by "include/tsDLList.h"

template <class T>
inline void tsDLList<T>::remove ( T &item )
{
    tsDLNode<T> &theNode = item;

    if ( this->pLast == &item ) {
        this->pLast = theNode.pPrev;
    }
    else {
        tsDLNode<T> &nextNode = *theNode.pNext;
        nextNode.pPrev = theNode.pPrev; <====== sometimes "nextNode.pPrev" is NULL here!!!
...

Memory map:
- nextNode {pNext=??? pPrev=??? } tsDLNode<nciu> &
        pNext CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden
        pPrev CXX0030: Fehler: Ausdruck kann nicht ausgewertet werden
+ this 0x003db1b4 {pFirst=0x00000000 pLast=0x00000000 itemCount=0 } tsDLList<nciu> * const
- item {eventq={...} accessRightState={...} cacCtx={...} ...} nciu &
+ cacChannel {priorityMax=99 priorityMin=0 priorityDefault=0 ...} cacChannel
+ chronIntIdRes<nciu> {...} chronIntIdRes<nciu>
+ channelNode {listMember=cs_createRespPend } channelNode
+ privateInterfaceForIO {...} privateInterfaceForIO
+ eventq {pFirst=0x00000000 pLast=0x00000000 itemCount=0 } tsDLList<baseNMIU>
+ accessRightState {f_readPermit=true f_writePermit=true
f_operatorConfirmationRequest=false } caAccessRights
+ cacCtx {_refLocalHostName={...} chanTable={...} ioTable={...} ...} cac &
+ pNameStr 0x003dafd8 "demoHost:double1" char *
+ piiu 0x100543ec class noopiiu noopIIU netiiu *
        sid 4294967295 unsigned int
        count 0 unsigned int
        retry 0 unsigned int
        nameLength 17 unsigned short
        typeCode 65535 unsigned short
        priority 0 unsigned char
- theNode {pNext=0x00000000 pPrev=0x00000000 } tsDLNode<nciu> &
+ pNext 0x00000000 {eventq={...} accessRightState={...} cacCtx=??? ...} nciu *
+ pPrev 0x00000000 {eventq={...} accessRightState={...} cacCtx=??? ...} nciu *

I checked this at several machines and I don't know what to do next.

Jeff Hill (johill-lanl) on 2011-02-15
Changed in epics-base:
milestone: none → 3.14.branch
status: Confirmed → Fix Committed
Jeff Hill (johill-lanl) wrote :

This crash appears that it might be fixed by prior revision 12173 (occurring on 2011-01-15) to ca/tcpiiu.cpp. Unfortunately, I am having trouble finding a bug report to link to in launchpad.

C:\hill\epicsInBazaar\R3.14\trunk\src\ca>bzr diff -c12173 tcpiiu.cpp
=== modified file 'src/ca/tcpiiu.cpp'
--- src/ca/tcpiiu.cpp 2010-09-20 21:21:50 +0000
+++ src/ca/tcpiiu.cpp 2011-01-15 00:53:33 +0000
@@ -1866,10 +1866,14 @@
     guard.assertIdenticalMutex ( this->mutex );

     while ( nciu * pChan = this->createReqPend.get () ) {
+ pChan->channelNode::listMember =
+ channelNode::cs_none;
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }

     while ( nciu * pChan = this->createRespPend.get () ) {
+ pChan->channelNode::listMember =
+ channelNode::cs_none;
         // we dont yet know the server's id so we cant
         // send a channel delete request and will instead
         // trust that the server can do the proper cleanup
@@ -1878,12 +1882,16 @@
     }

     while ( nciu * pChan = this->v42ConnCallbackPend.get () ) {
+ pChan->channelNode::listMember =
+ channelNode::cs_none;
         this->clearChannelRequest ( guard,
             pChan->getSID(guard), pChan->getCID(guard) );
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }

     while ( nciu * pChan = this->subscripReqPend.get () ) {
+ pChan->channelNode::listMember =
+ channelNode::cs_none;
         pChan->disconnectAllIO ( cbGuard, guard );
         this->clearChannelRequest ( guard,
             pChan->getSID(guard), pChan->getCID(guard) );
@@ -1891,6 +1899,8 @@
     }

     while ( nciu * pChan = this->connectedList.get () ) {
+ pChan->channelNode::listMember =
+ channelNode::cs_none;
         pChan->disconnectAllIO ( cbGuard, guard );
         this->clearChannelRequest ( guard,
             pChan->getSID(guard), pChan->getCID(guard) );
@@ -1898,6 +1908,8 @@
     }

     while ( nciu * pChan = this->unrespCircuit.get () ) {
+ pChan->channelNode::listMember =
+ channelNode::cs_none;
         pChan->disconnectAllIO ( cbGuard, guard );
         // if we know that the circuit is unresponsive
         // then we dont send a channel delete request and
@@ -1907,6 +1919,8 @@
     }

      while ( nciu * pChan = this->subscripUpdateReqPend.get () ) {
+ pChan->channelNode::listMember =
+ channelNode::cs_none;
         pChan->disconnectAllIO ( cbGuard, guard );
         this->clearChannelRequest ( guard,
             pChan->getSID(guard), pChan->getCID(guard) );

S o l v e d

After patching of 'src/libCom/osi/os/WIN32/osdThread.c' and 'src/ca/tcpiiu.cpp' all seems to work
fine now!
Thank you very much!!!

Andrew Johnson (anj) on 2011-04-26
Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers