'Bad file number' messages
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Fix Released
|
High
|
Jeff Hill |
Bug Description
I've been working with the latest ca gateway here at the LBNL. I'm
running it on solaris with base-3.14.6 to basically alias about 85 PVs
in various IOC crates around the storage ring. It's working but it
sometimes stops responding and when I look at the gateway.log it's huge,
like 500MB. The messages seem to be related to fds. This symptom looks
like it correlates with rebooting some of the IOC crates (e.g. crioc02) .
So I'm wondering if you've seen this type of behavior before and if you
had a suggestion.
Chris Timossi
LBNL
The gateway.log looks like:
<email address hidden>:(dev1.als) more gateway.
Jul 13 14:28:09 PV Gateway Version 2.0.0.0Beta8 [Jun 10 2004 08:51:35]
EPICS 3.14.6 PID=8205 ServerPID=8204
Statistics PV prefix is dev1
Jul 13 14:32:04 Warning: Virtual circuit unresponsive
crioc02.
Jul 13 14:32:24 Warning: Virtual circuit disconnect crioc02.
Jul 13 14:55:56 Warning: Virtual circuit unresponsive
crioc02.
CAS: FIONREAD for bl7-46.
Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for bl9-31.
Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for bl9-31.
Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for bl9-31.
Jul 13 14:55:56 !!! Errlog message received (message is above)
CAS: FIONREAD for blc92-30.
Jul 13 14:55:56 !!! Errlog message received (message is above)
fdManager: select failed because "Bad file number"
----------- many, many, of the following lines -----------------
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
fdManager: select failed because "Bad file number"
Jul 13 13:35:32 PV Gateway Ending (SIGTERM)
Original Mantis Bug: mantis-107
http://
There is a constant associated with the mask used by select called FD_SETSIZE. The fdManager.cpp sets it to 4096. If there was a problem with that you would probably see this message.
fprintf (stderr, "%s: fd > FD_SETSIZE ignored\\n", __FILE__);
Otherwise, some of the OS have built in file descriptor limits that are set by kernel configuration. I seem to recall (now) that Ken has been down this road before on Solaris. Any comments Ken?
What appears to be significant is that only select and ioctl FIONREAD are bailing out, and setting errno to "bad file number". This does not appear to be occurring when the server or client tries to make a new file descriptor or make other system calls using the file descriptor in question. Therefore, we have to conclude that either those calls are hitting a special limit, or else there is corruption that other parts of the code are not seeing. One possibility might be that an fdManager data structure was still in use after it had been destroyed although preliminary inspections of the code do not lead to this conclusion. The ioctl FIONREAD in the server for TCP circuits is also failing, and so therefore if there is this type of corruption it is occurring in the server library also.