EPICS Base

assert (ev_que->evque[ev_que->putix] == EVENTQEMPTY) failed in dbEvent.c

Bug #541371 reported by Jeff Hill on 2009-06-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	Triaged	Low	Unassigned

Bug Description

2009-06-04 10:00 iocsrvid2>

CAS-client: A call to "assert (ev_que->evque[ev_que->putix] == EVENTQEMPTY)" failed in ../dbEvent.c at 683 Current time THU JUN 04 2009 10:08:33.581404952.
EPICS Release EPICS R3.14.10-asd1 $R3-14-10$ $2008/10/27 19:39:04$.
Please E-mail this message and the output from "tt (0x13144d0)"
to the author or to <email address hidden> Thread CAS-client (0x13144d0) suspended

causing all the network buffers to get used up and all network-related activities to die, e.g:

callbackRequest: cbHigh ring buffer full
NTPTimeSync: NTP requests failing - S_sntpcLib_TIMEOUT DB CA Link Exception: "Virtual circuit unresponsive", context "164.54.2.121:5064"
save_restore:do_seq - '/iocinfo/saverestore/iocsrvid2/ srvid2AutosaveV3.sav' not found. Writing a new one. [090604-101331]
../online_notify.c: CA beacon (send to "164.54.3.255:5065") error was "S_errno_ENOBUFS"

iocsrvid2> tt 0x13144d0
133288 vxTaskEntry +68 : 19290c0 ()
1929118 epicsThreadOnceOsd+168: camsgtask ()
1864ab4 camsgtask +260: camessage ()
1868ba8 camessage +2f0: 1867bb8 ()
1867e3c rsrvFreePutNotify+880: db_post_single_event () 188629c db_post_single_event+38 : 1885f84 () 1886080 db_post_extra_labor+194: epicsAssert ()
1928234 epicsAssert +f8 : epicsThreadSuspendSelf ()
1929274 epicsThreadSuspendSelf+14 : taskSuspend () value = 0 = 0x0

This is a PowerPC running vxWorks 5.5.2 and Base R3.14.10. The IOC had to be rebooted. This is the second or third time we've seen ENOBUFS recently, although the engineers didn't see (look for) assertion failure before this.

Original Mantis Bug: mantis-342
http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=342

Tags:

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-06-04:

This caused only one thread to stop so I doubt that the CA server is consuming an abnormal number of MBUFs. Each TCP circuit has a hard quota. Since NTP time sync is failing also it sounds like a network stack issue is the precipitating event.

Nevertheless, this could indicate some kind of bug in the event queue occurring only when the network stack is clobbered. Another possibility; some errant code clobbered the data structures of both the ca server and the network stack.

I will need to have a careful look at the source code.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-06-04:

> This caused only one thread to stop so I doubt that the CA server is
> consuming an abnormal number of MBUFs. Each TCP circuit has a hard quota.
> Since NTP time sync is failing also it sounds like a network stack
> issue is the precipitating event.

The thread was in state SUSPEND+I which means its priority was inherited due to it owning a semaphore that some higher priority task wanted. The output from inetstatShow some time later (see below) demonstrates that the network buffers were probably all being used up by requests from CA clients not being handled.

iocsrvid2> inetstatShow
Active Internet PCB Proto Recv-Q Send-Q -------- ----- ------ ------ 1d4c958 TCP 8056 1d4c8d4 TCP 8760 1d4c850 TCP 8760 1d4c7cc TCP 8760 1d4c748 TCP 8760 1d4c6c4 TCP 8760 1d4c640 TCP 8760 1d4c094 TCP 8760 1d4c118 TCP 2484 1d4bf8c TCP 1d4c430 TCP 1d4c010 TCP 1d4c4b4 TCP 1d4c3ac TCP 2560 1d4c19c TCP 1d4ba64 TCP 1d4bd7c UDP 1d4c328 UDP 1d4c2a4 UDP 1d4c220 UDP 1d4bf08 UDP 1d4be00 UDP 1d4bcf8 UDP 1d4bc74 UDP 1d4bbf0 UDP 1d4bb6c UDP 1d4bae8 UDP 1d4b9e0 UDP 1d4b95c UDP 1d4b8d8 UDP value = 1 = 0x1 connections (including servers)
Local Address Foreign Address (state)
----------------- ------------------ -------
0 164.54.3.153.5064 164.54.3.75.45183 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45174 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45164 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45155 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45146 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45136 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45127 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45118 ESTABLISHED
0 164.54.3.153.5064 164.54.3.75.45102 CLOSE_WAIT
0 0 164.54.3.153.5064 164.54.1.62.53928 ESTABLISHED
0 0 164.54.3.153.5064 164.54.3.62.1067 ESTABLISHED
0 0 164.54.3.153.5064 164.54.2.123.1034 ESTABLISHED
0 0 164.54.3.153.5064 164.54.2.217.41003 ESTABLISHED
0 164.54.3.153.5064 164.54.1.62.36421 ESTABLISHED
0 0 164.54.3.153.5064 164.54.2.253.52591 ESTABLISHED
0 0 0.0.0.0.5064 0.0.0.0.0 LISTEN
0 0 0.0.0.0.673 0.0.0.0.0
0 0 0.0.0.0.674 0.0.0.0.0
0 0 164.54.3.153.1032 0.0.0.0.0
0 0 127.0.0.1.1031 127.0.0.1.1026
0 0 0.0.0.0.894 0.0.0.0.0
0 0 0.0.0.0.1030 0.0.0.0.0
0 0 0.0.0.0.5065 0.0.0.0.0
0 0 0.0.0.0.1027 0.0.0.0.0
0 0 0.0.0.0.1026 0.0.0.0.0
0 0 0.0.0.0.5064 0.0.0.0.0
0 0 164.54.3.153.1028 164.54.3.255.5065
0 0 0.0.0.0.1025 0.0.0.0.0
0 0 0.0.0.0.898 0.0.0.0.0
0 0 0.0.0.0.18322 0.0.0.0.0

> This caused only one thread to stop so I doubt that the CA server is 
> consuming an abnormal number of MBUFs. Each TCP circuit has a hard quota.
> Since NTP time sync is failing also it sounds like a network stack 
> issue is the precipitating event.

The thread was in state SUSPEND+I which means its priority was inherited due to it owning a semaphore that some higher priority task wanted.  The output from inetstatShow some time later (see below) demonstrates that the network buffers were probably all being used up by requests from CA clients not being handled.

iocsrvid2> inetstatShow
Active Internet connections (including servers)
PCB      Proto Recv-Q Send-Q  Local Address     Foreign Address    (state)
-------- ----- ------ ------  ----------------- ------------------ -------
1d4c958  TCP     8056      0  164.54.3.153.5064 164.54.3.75.45183  ESTABLISHED
1d4c8d4  TCP     8760      0  164.54.3.153.5064 164.54.3.75.45174  ESTABLISHED
1d4c850  TCP     8760      0  164.54.3.153.5064 164.54.3.75.45164  ESTABLISHED
1d4c7cc  TCP     8760      0  164.54.3.153.5064 164.54.3.75.45155  ESTABLISHED
1d4c748  TCP     8760      0  164.54.3.153.5064 164.54.3.75.45146  ESTABLISHED
1d4c6c4  TCP     8760      0  164.54.3.153.5064 164.54.3.75.45136  ESTABLISHED
1d4c640  TCP     8760      0  164.54.3.153.5064 164.54.3.75.45127  ESTABLISHED
1d4c094  TCP     8760      0  164.54.3.153.5064 164.54.3.75.45118  ESTABLISHED
1d4c118  TCP     2484      0  164.54.3.153.5064 164.54.3.75.45102  CLOSE_WAIT
1d4bf8c  TCP        0      0  164.54.3.153.5064 164.54.1.62.53928  ESTABLISHED
1d4c430  TCP        0      0  164.54.3.153.5064 164.54.3.62.1067   ESTABLISHED
1d4c010  TCP        0      0  164.54.3.153.5064 164.54.2.123.1034  ESTABLISHED
1d4c4b4  TCP        0      0  164.54.3.153.5064 164.54.2.217.41003 ESTABLISHED
1d4c3ac  TCP     2560      0  164.54.3.153.5064 164.54.1.62.36421  ESTABLISHED
1d4c19c  TCP        0      0  164.54.3.153.5064 164.54.2.253.52591 ESTABLISHED
1d4ba64  TCP        0      0  0.0.0.0.5064      0.0.0.0.0          LISTEN
1d4bd7c  UDP        0      0  0.0.0.0.673       0.0.0.0.0
1d4c328  UDP        0      0  0.0.0.0.674       0.0.0.0.0
1d4c2a4  UDP        0      0  164.54.3.153.1032 0.0.0.0.0
1d4c220  UDP        0      0  127.0.0.1.1031    127.0.0.1.1026
1d4bf08  UDP        0      0  0.0.0.0.894       0.0.0.0.0
1d4be00  UDP        0      0  0.0.0.0.1030      0.0.0.0.0
1d4bcf8  UDP        0      0  0.0.0.0.5065      0.0.0.0.0
1d4bc74  UDP        0      0  0.0.0.0.1027      0.0.0.0.0
1d4bbf0  UDP        0      0  0.0.0.0.1026      0.0.0.0.0
1d4bb6c  UDP        0      0  0.0.0.0.5064      0.0.0.0.0
1d4bae8  UDP        0      0  164.54.3.153.1028 164.54.3.255.5065
1d4b9e0  UDP        0      0  0.0.0.0.1025      0.0.0.0.0
1d4b95c  UDP        0      0  0.0.0.0.898       0.0.0.0.0
1d4b8d8  UDP        0      0  0.0.0.0.18322     0.0.0.0.0
value = 1 = 0x1

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-06-04:

> The output from inetstatShow some time later (see below) demonstrates that the
> network buffers were probably all being used up by requests from CA clients not
> being handled.

Since the receive side is full this does point to some issue upstream of the IP kernel.

> PCB Proto Recv-Q Send-Q Local Address Foreign Address (state)
> 1d4c958 TCP 8056 0 164.54.3.153.5064 164.54.3.75.45183

It's significant that at least 8 of the server's receive threads are not keeping up with their requests. That might indicate that a global lock was compromised or deadlocked in the server or the database. That might also indicate that a high priority thread is using all of the CPU (possibly a device driver or tNetTask).

A stack trace from one of the stuck receive threads in the server might be very useful information.

Furthermore, one would expect that the send queue, and not the receive queue, would be full if something odd was going on with dbEvent.c. The normal quiescent condition which runs 24/7 is a more or less empty send queue. However, if the system is down on MBUFs then one could imagine that the dbEvent.c issue occurred in the past when a TCP output queue was temporarily stalled?

edited on: 2009-06-04 16:46

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-02:

I had a look at the dbEvent.c code and the basic algorithm appears to be correct, and should be exercised quite often if there is just about any client that’s slow on the uptake. There are a number of modes surrounding how this operates. In particular the number of subscriptions on a particular event queue. So there is the possibility of an untested mode. Nevertheless, this is used in many situations.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-02:

Has this problem ever been seen outside of one particular IOC? I can’t help but notice that many things are misbehaving other than this assert. For example, eight of the server’s input threads are either CPU starved or unable to get a lock that they need. I also see “callbackRequest: cbHigh ring buffer full” which might happen if device support is posting IO interrupt events at a high rate.

edited on: 2009-07-01 17:09

Andrew Johnson (anj) on 2010-03-24

Changed in epics-base:
status:	New → Incomplete

Andrew Johnson (anj) on 2010-09-23

Changed in epics-base:
importance:	High → Low

mdavidsaver (mdavidsaver) on 2015-12-03

tags:	added: cleanup
Changed in epics-base:
status:	Incomplete → Invalid

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2020-05-04:

Another occurrence?

https://epics.anl.gov/tech-talk/2020/msg00894.php

callbackRequest: cbLow ring buffer full
A call to 'assert(ev_que->evque[ev_que->putix] == EVENTQEMPTY)'
by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 769.
Dumping a stack trace of thread 'CAS-client':
[ 0x54b86b]: ../../bin/linux-x86_64/CryD(epicsStackTrace+0x4b)
[ 0x5467fa]: ../../bin/linux-x86_64/CryD(epicsAssert+0x3a)
[ 0x4daa3b]: ../../bin/linux-x86_64/CryD(db_queue_event_log+0xcb)
[ 0x4dac8d]: ../../bin/linux-x86_64/CryD(db_post_single_event+0x3d)
[ 0x4ffce7]: ../../bin/linux-x86_64/CryD(event_add_action+0x1b7)
[ 0x5005de]: ../../bin/linux-x86_64/CryD(camessage+0x9e)
[ 0x4fd00c]: ../../bin/linux-x86_64/CryD(camsgtask+0x12c)
[ 0x54789c]: ../../bin/linux-x86_64/CryD(start_routine+0xdc)
[ 0x7f3c14cdf064]: /lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xc4)
[ 0x7f3c13b9c62d]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)
EPICS Release EPICS R3.15.5.
Local time is 2020-05-03 13:45:51.091659821 EDT
Please E-mail this message to the author or to tech-talk at aps.anl.gov
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x7f3bc4016f20) suspended
callbackRequest: cbLow ring buffer full

Changed in epics-base:
status:	Invalid → Incomplete

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2021-05-19:

And again https://epics.anl.gov/tech-talk/2021/msg01039.php

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2021-05-20:

For the two instances reached through event_add_action(), I wonder if this is a use-after-free race with a client which is disconnecting. Use of freelibLib could keep this from being detected, or obviously crashing. Perhaps related to asComputeAsg(), and thus casAccessRightsCB(), which is one way in which RSRV clients can interact with each other directly.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2021-05-20:

Running softIoc with a dynamic ACF file, and disabling/bypassing freeListLib, turns up a candidate.

> Possible data race during write of size 8 at 0x542B6C8 by thread #19
> Locks held: 4, at addresses 0x53AFE20 0x53B9880 0x54230C0 0x542BBE0
> at 0x48FFB23: db_queue_event_log (dbEvent.c:824)
> by 0x48FFDB8: db_post_single_event (dbEvent.c:912)
> by 0x493B440: casAccessRightsCB (camessage.c:1114)
> by 0x49F111C: asComputePvt (asLibRoutines.c:1043)
> by 0x49F0E7D: asComputeAsgPvt (asLibRoutines.c:969)
> by 0x49EF50B: asComputeAsg (asLibRoutines.c:484)
> by 0x48DB897: eventCallback (asCa.c:161)
> by 0x49AD242: oldSubscription::current(epicsGuard<epicsMutex>&, unsigned int, unsigned long, void const*) (oldSubscription.cpp:67)
> by 0x490F291: dbContext::callStateNotify(dbChannel*, unsigned int, unsigned long, db_field_log const*, cacStateNotify&) (dbContext.cpp:189)
> by 0x4913209: dbChannelIO::callStateNotify(unsigned int, unsigned long, db_field_log const*, cacStateNotify&) (dbChannelIO.h:110)
> by 0x4912FB8: dbSubscriptionEventCallback (dbSubscriptionIO.cpp:118)
> by 0x490009B: event_read (dbEvent.c:997)
>
> This conflicts with a previous read of size 8 by thread #23
> Locks held: none
> at 0x4900074: event_read (dbEvent.c:998)
> by 0x4900332: event_task (dbEvent.c:1076)
> by 0x4A183CD: start_routine (osdThread.c:412)
> by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
> by 0x4F8FFA2: start_thread (pthread_create.c:486)
> by 0x4EBE4CE: clone (clone.S:95)
> Address 0x542b6c8 is 1,192 bytes inside a block of size 2,432 alloc'd
> at 0x4838BB5: calloc (vg_replace_malloc.c:752)
> by 0x4A014A6: freeListCalloc (freeListLib.c:95)
> by 0x48FE995: db_init_events (dbEvent.c:304)
> by 0x49388B0: create_tcp_client (caservertask.c:1482)
> by 0x49351B5: req_server (caservertask.c:99)
> by 0x4A183CD: start_routine (osdThread.c:412)
> by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
> by 0x4F8FFA2: start_thread (pthread_create.c:486)
> by 0x4EBE4CE: clone (clone.S:95)
> Block was alloc'd by thread #20

Running softIoc with a dynamic ACF file, and disabling/bypassing freeListLib, turns up a candidate.

> Possible data race during write of size 8 at 0x542B6C8 by thread #19
> Locks held: 4, at addresses 0x53AFE20 0x53B9880 0x54230C0 0x542BBE0
>    at 0x48FFB23: db_queue_event_log (dbEvent.c:824)
>    by 0x48FFDB8: db_post_single_event (dbEvent.c:912)
>    by 0x493B440: casAccessRightsCB (camessage.c:1114)
>    by 0x49F111C: asComputePvt (asLibRoutines.c:1043)
>    by 0x49F0E7D: asComputeAsgPvt (asLibRoutines.c:969)
>    by 0x49EF50B: asComputeAsg (asLibRoutines.c:484)
>    by 0x48DB897: eventCallback (asCa.c:161)
>    by 0x49AD242: oldSubscription::current(epicsGuard<epicsMutex>&, unsigned int, unsigned long, void const*) (oldSubscription.cpp:67)
>    by 0x490F291: dbContext::callStateNotify(dbChannel*, unsigned int, unsigned long, db_field_log const*, cacStateNotify&) (dbContext.cpp:189)
>    by 0x4913209: dbChannelIO::callStateNotify(unsigned int, unsigned long, db_field_log const*, cacStateNotify&) (dbChannelIO.h:110)
>    by 0x4912FB8: dbSubscriptionEventCallback (dbSubscriptionIO.cpp:118)
>    by 0x490009B: event_read (dbEvent.c:997)
> 
> This conflicts with a previous read of size 8 by thread #23
> Locks held: none
>    at 0x4900074: event_read (dbEvent.c:998)
>    by 0x4900332: event_task (dbEvent.c:1076)
>    by 0x4A183CD: start_routine (osdThread.c:412)
>    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
>    by 0x4F8FFA2: start_thread (pthread_create.c:486)
>    by 0x4EBE4CE: clone (clone.S:95)
>  Address 0x542b6c8 is 1,192 bytes inside a block of size 2,432 alloc'd
>    at 0x4838BB5: calloc (vg_replace_malloc.c:752)
>    by 0x4A014A6: freeListCalloc (freeListLib.c:95)
>    by 0x48FE995: db_init_events (dbEvent.c:304)
>    by 0x49388B0: create_tcp_client (caservertask.c:1482)
>    by 0x49351B5: req_server (caservertask.c:99)
>    by 0x4A183CD: start_routine (osdThread.c:412)
>    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
>    by 0x4F8FFA2: start_thread (pthread_create.c:486)
>    by 0x4EBE4CE: clone (clone.S:95)
>  Block was alloc'd by thread #20

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2021-05-20 (last edit on 2021-06-01):

#10

As I read it, access to event_ext::pdbev (cf. server.h) should be guarded by client::eventqLock lock, but isn't always. I also have some question about the validity of the event_block being pointed to. It looks like this could be invalidated by a concurrent db_cancel_event().

mdavidsaver (mdavidsaver) on 2021-05-30

Changed in epics-base:
status:	Incomplete → Triaged
tags:	added: rsrv removed: 3.14 cleanup

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-05-31:

#11

I think, this is the same bug:
https://epics.anl.gov/tech-talk/2018/msg00740.php

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-05-31:

#12

Maybe also related to https://epics.anl.gov/tech-talk/2006/msg00522.php, https://epics.anl.gov/tech-talk/2010/msg01751.php and https://epics.anl.gov/tech-talk/2008/msg00408.php

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2023-09-07:

#13

Another https://epics.anl.gov/tech-talk/2023/msg01324.php

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2023-09-18:

#14

This bug may be caused by not using spin locks in epicsRingPointer*.

I had back-ported the EPICS 7 implementation of osdSpin, epicsRingPointer* and callback.c. I got the impression that helped.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2023-09-20:

#15

I may understand what could cause this assert() to fail.

I think the assumption behind the reservation logic (4 events per subscriber * 36 subscribes per ev_que) is violated between a call to db_cancel_event() when quota is decremented, and event_task cleaning out the cancelled events to make that capacity available.

wee https://github.com/epics-base/epics-base/pull/424#issuecomment-1728457392

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.