when repeatedly running command line clients they sometimes (once in 200 tries) dont connect
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Fix Released
|
Critical
|
Andrew Johnson |
Bug Description
Bob Soliday has reported that when repeatedly running
command line clients with circuits to several IOCs
they sometimes (once in 200 tries) dont always connect.
Additional information:
More information (from Andrew) ...
I changed the timeout on the final ca_pend_io() call to 10 minutes, and
ran the test code while I had my caClientMonitor program monitoring all
the CA searches on the network. After a while all searches from this
workstation (hydroxide) stopped while the program was still active.
This is the output from caClientMonitor over that transition:
2006-02-13 17:07:16.280346070 Clients=66 Searches=904 Rate=90.12
Client's UDP Port Age Req's Names
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
gateway437
gateway437
gateway431
2006-02-13 17:07:26.340245867 Clients=61 Searches=675 Rate=67.10
Client's UDP Port Age Req's Names
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
hydroxide:
gateway431
2006-02-13 17:07:36.358291235 Clients=52 Searches=1049 Rate=104.71
Client's UDP Port Age Req's Names
cyclops:46721 0.00 1 1
cyclops:46723 0.00 8 8
ctlapps4l:
landau:33048 58.90 700 20
cyclops:46711 0.00 1 1
cyclops:46717 0.11 142 120
cyclops:46719 0.04 126 120
cyclops:46713 0.00 1 1
cyclops:46715 0.00 3 3
gateway437
gateway436
2006-02-13 17:07:46.370780548 Clients=36 Searches=3507 Rate=350.26
Client's UDP Port Age Req's Names
landau:33048 68.15 3475 20
gateway437
gateway436
gateway431
2006-02-13 17:07:56.412832476 Clients=17 Searches=3198 Rate=318.46
Client's UDP Port Age Req's Names
landau:33048 78.96 3175 20
gateway436
echo:38210 0.21 4 1
echo:38212 0.22 9 7
At this point the program is sitting in ca_pend_io() but not sending any
search requests at all. The relevent parts of a netstat output are:
hydroxide% netstat -a
UDP: IPv4
Local Address Remote Address State
-------
*.5065 Idle
TCP: IPv4
Local Address Remote Address Swind Send-Q Rwind Recv-Q State
---------------- ----------------- ----- ------ ----- ------ -------
hydroxide.36974 iocmtime.5064 8192 0 49640 0 ESTABLISHED
hydroxide.36975 iocbramp.5064 8192 0 49640 0 ESTABLISHED
hydroxide.36976 iocinjtime.5064 8192 0 49248 0 ESTABLISHED
hydroxide.36977 iocrfsource.5064 8192 0 49640 0 ESTABLISHED
hydroxide.36978 iocbbpm3.5064 8192 0 49304 0 ESTABLISHED
hydroxide.36979 iocbbpm1.5064 8192 0 49024 0 ESTABLISHED
hydroxide.36980 iocbbpm2.5064 8192 0 49304 0 ESTABLISHED
There were no TCP sockets in TIME_WAIT state.
I took the opportunity to kill -QUIT the program and examine the
resulting core dump:
(dbx) threads
o> t@1 a l@1 ?() sleep on 0x12e410 in __lwp_park()
t@2 a l@2 start_routine() sleep on 0x12eb00 in __lwp_park()
t@3 a l@3 start_routine() sleep on 0x12ed48 in __lwp_park()
t@4 a l@4 start_routine() LWP suspended in _so_recvfrom()
t@5 a l@5 start_routine() LWP suspended in _so_recv()
t@6 a l@6 start_routine() sleep on 0x167240 in __lwp_park()
t@7 a l@7 start_routine() LWP suspended in _so_recv()
t@8 a l@8 start_routine() sleep on 0x1f1008 in __lwp_park()
t@9 a l@9 start_routine() LWP suspended in _so_recv()
t@10 a l@10 start_routine() sleep on 0x1f5290 in __lwp_park()
t@11 a l@11 start_routine() LWP suspended in _so_recv()
t@12 a l@12 start_routine() sleep on 0x1f9518 in __lwp_park()
t@13 a l@13 start_routine() LWP suspended in _so_recv()
t@14 a l@14 start_routine() sleep on 0x1fa0e0 in __lwp_park()
t@15 a l@15 start_routine() LWP suspended in _so_recv()
t@16 a l@16 start_routine() sleep on 0x203008 in __lwp_park()
t@17 a l@17 start_routine() LWP suspended in _so_recv()
t@18 a l@18 start_routine() sleep on 0x207290 in __lwp_park()
The main thread is in ca_pend_io():
(dbx) where t@1
current thread: t@1
[1] __lwp_park(0x4, 0xffbfbfd8, 0x0, 0x1, 0x0, 0x0), at 0xff065998
[2] cond_wait_
0xff078000), at 0xff062b9c
[3] cond_wait_
0x43f1112a), at 0xff063114
[4] _ti_cond_
0xffbfeed6), at 0xff0635a4
[5] _cond_timedwait
0x0, 0x0), at 0xff0635d8
[6] _ti_pthread_
0xffbfeb9b, 0xffbfc184), at 0xff063618
=>[7] condTimedwait(
0xffbfc184), line 65 in "osdEvent.c"
[8] epicsEventWaitW
line 157 in "osdEvent.c"
[9] epicsEvent:
"epicsEvent.cpp"
[10] ca_client_
0x12e340, event = CLASS, timeout = 600.0), line 607 in
"ca_client_
[11] ca_client_
line 506 in "ca_client_
[12] ca_pend_io(timeout = 600.0), line 481 in "access.cpp"
[13] main(argc = 2, argv = 0xffbff364), line 218 in "debugTest.c"
This is the ipAddrToAsciiEngine worker thread:
(dbx) where t@2
current thread: t@2
=>[1] __lwp_park(0x4, 0x0, 0x0, 0x1, 0xff078000, 0x0), at 0xff065998
[2] cond_wait_
0xff078000), at 0xff062b9c
[3] _cond_wait_
0x1f9590, 0x920), at 0xff063358
[4] _ti_pthread_
0x81010100, 0xff0000), at 0xff063394
[5] condWait(condId = 0x12eb00, mutexId = 0x12eae8), line 75 in
"osdEvent.c"
[6] epicsEventWait(
[7] epicsEvent:
[8] ipAddrToAsciiEn
"ipAddrToAsciiA
[9] epicsThreadCall
"epicsThread.cpp"
[10] start_routine(arg = 0x12ebc8), line 320 in "osdThread.c"
Here's the timerQueue thread:
(dbx) where t@3
current thread: t@3
=>[1] __lwp_park(0x4, 0xfee7fa60, 0x0, 0x1, 0x0, 0x0), at 0xff065998
[2] cond_wait_
0xff078000), at 0xff062b9c
[3] cond_wait_
0x43f1116b), at 0xff063114
[4] _ti_cond_
0x0), at 0xff0635a4
[5] _cond_timedwait
0xfee7fa74, 0x10), at 0xff0635d8
[6] _ti_pthread_
0x1e5273eb, 0xfee7fc0c), at 0xff063618
[7] condTimedwait(
0xfee7fc0c), line 65 in "osdEvent.c"
[8] epicsEventWaitW
65.534625872), line 157 in "osdEvent.c"
[9] epicsEvent:
72 in "epicsEvent.cpp"
[10] timerQueueActiv
"timerQueueActi
[11] epicsThreadCall
"epicsThread.cpp"
[12] start_routine(arg = 0x12ee10), line 320 in "osdThread.c"
At this point I imagine you might be able to extract additional
information about the state of the client library, so I have kept the
core file and executable from this.
On a later run, I ran caClientMonitor set up to monitor just searches
from this workstation, and found that during the 10 minute period there
appeared to be no searches being sent out at all (or at least none that
the CAS tool was being presented with).
Original Mantis Bug: mantis-240
http://
>
> Andrew: Do you have preemptive POSIX scheduling enabled?
% cd configure SITE:USE_ POSIX_THREAD_ PRIORITY_ SCHEDULING = NO
% grep PRIORITY * os/* CONFIG_
Thus we have preemptive scheduling, but its not priority-based.