'Bad file number' messages

Bug #541178 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
Jeff Hill

Bug Description

I've been working with the latest ca gateway here at the LBNL. I'm
running it on solaris with base-3.14.6 to basically alias about 85 PVs
in various IOC crates around the storage ring. It's working but it
sometimes stops responding and when I look at the gateway.log it's huge,
like 500MB. The messages seem to be related to fds. This symptom looks
like it correlates with rebooting some of the IOC crates (e.g. crioc02) .

So I'm wondering if you've seen this type of behavior before and if you
had a suggestion.

Chris Timossi
LBNL

The gateway.log looks like:
<email address hidden>:(dev1.als) more gateway.log.1089756614
Jul 13 14:28:09 PV Gateway Version 2.0.0.0Beta8 [Jun 10 2004 08:51:35]
EPICS 3.14.6 PID=8205 ServerPID=8204
Statistics PV prefix is dev1
Jul 13 14:32:04 Warning: Virtual circuit unresponsive
crioc02.als.lbl.gov:5064
Jul 13 14:32:24 Warning: Virtual circuit disconnect crioc02.als.lbl.gov:5064
Jul 13 14:55:56 Warning: Virtual circuit unresponsive
crioc02.als.lbl.gov:5064
CAS: FIONREAD for bl7-46.als.lbl.gov:33044 failed because "Bad file number"

Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for bl9-31.als.lbl.gov:1111 failed because "Bad file number"

Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for bl9-31.als.lbl.gov:1112 failed because "Bad file number"

Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for bl9-31.als.lbl.gov:1110 failed because "Bad file number"

Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for blc92-30.als.lbl.gov:1034 failed because "Bad file number"

Jul 13 14:55:56 !!! Errlog message received (message is above)
fdManager: select failed because "Bad file number"

----------- many, many, of the following lines -----------------

fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
Jul 13 13:35:32 PV Gateway Ending (SIGTERM)

Original Mantis Bug: mantis-107
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=107

Tags: cas 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

There is a constant associated with the mask used by select called FD_SETSIZE. The fdManager.cpp sets it to 4096. If there was a problem with that you would probably see this message.

        fprintf (stderr, "%s: fd > FD_SETSIZE ignored\\n", __FILE__);

Otherwise, some of the OS have built in file descriptor limits that are set by kernel configuration. I seem to recall (now) that Ken has been down this road before on Solaris. Any comments Ken?

What appears to be significant is that only select and ioctl FIONREAD are bailing out, and setting errno to "bad file number". This does not appear to be occurring when the server or client tries to make a new file descriptor or make other system calls using the file descriptor in question. Therefore, we have to conclude that either those calls are hitting a special limit, or else there is corruption that other parts of the code are not seeing. One possibility might be that an fdManager data structure was still in use after it had been destroyed although preliminary inspections of the code do not lead to this conclusion. The ioctl FIONREAD in the server for TCP circuits is also failing, and so therefore if there is this type of corruption it is occurring in the server library also.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Ken Evans

You are the only one having this problem. It is likely something with your
system.

I don't remember if you said if you were building with gcc. If so, use the
Sun compilers.

I think you said you had 3.14.6 and Gateway Beta11. If not, get them.
There is no need to debug old code.

Could permissions be a problem?

Could your system or some other application not be closing FDs. SSH?

The Gateway will try to get the maximum number of FDs available. This
maximum (hard limit) is set by system administrators. limit will show the
limits. limit -h will set it to the hard limits and show you what they are.
The Gateway sets it to the hard limits when it starts.

On Solaris only 256 FDs are available for fopen(). However, the CA socket
calls can access up to the hard limit. This is only a problem for opening
files when the sockets have used all the FDs up to 255. The Gateway works
around this by reserving one. This problem is only relevent for opening
files, such as the Gateway command file. Because of the workaround, it
should not be your problem, anyway.

In any event, you should not be near the limits unless you have lots of
connections to lots of IOCs.

lsof -p <pid> lists all the FDs for a PID. Use it on the Gateway pid. You
may have to look at the documentation for lsof.

pstack <core> | c++filt >! pstack.out is a way to get a stack trace. You
can make a core with kill -ABRT <pid>. I think it also works with <pid> in
place of <core>. Again see the documentation.

truss is useful for determining what is happening on the system level.

The best way to debug this is to build a debug version of both base and the
Gateway and try to trap what is happening in a debugger. That is what I
would do if I were there.

You may have to get a system administrator to do some of the above commands.
Depends on your sustem.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

As mentioned in previous MM, I have a strong guess now that your solo experience with this issue results from your having an old EPICS version in the IOC that disconnected.

To test this theory please modify the source code in "base/src/cas/io/bsdSocket/casStreamIO.cc" line 201 as follows.

This may result in some benign messages of the form "CAC TCP socket shutdown error was" that I will need to clean out with more extensive source code changes once you confirm that my guess resolves your issue.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

BEFORE:

// casStreamIO::forceDisconnect()
void casStreamIO::forceDisconnect ()
{
 if ( ! this->sockHasBeenClosed ) {
        this->sockHasBeenClosed = true;
        int status = ::shutdown ( this->sock, SHUT_RDWR );
        if ( status ) {
            char sockErrBuf[64];
            epicsSocketConvertErrnoToString ( sockErrBuf, sizeof ( sockErrBuf ) );
            errlogPrintf ("CAC TCP socket shutdown error was %s\\n",
                sockErrBuf );
        }
  epicsSocketDestroy ( this->sock );
        // other wakeup will be required here when we
        // switch to a threaded implementation
 }
}

AFTER:

// casStreamIO::forceDisconnect()
void casStreamIO::forceDisconnect ()
{
        int status = ::shutdown ( this->sock, SHUT_RDWR );
        if ( status ) {
            char sockErrBuf[64];
            epicsSocketConvertErrnoToString ( sockErrBuf, sizeof ( sockErrBuf ) );
            errlogPrintf ("CAC TCP socket shutdown error was %s\\n",
                sockErrBuf );
        }
}

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in R3.14.7

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.7 Released

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.