assert (ev_que->evque[ev_que->putix] == EVENTQEMPTY) failed in dbEvent.c

Bug #541371 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Triaged
Low
Unassigned

Bug Description

2009-06-04 10:00 iocsrvid2>

CAS-client: A call to "assert (ev_que->evque[ev_que->putix] == EVENTQEMPTY)" failed in ../dbEvent.c at 683 Current time THU JUN 04 2009 10:08:33.581404952.
EPICS Release EPICS R3.14.10-asd1 $R3-14-10$ $2008/10/27 19:39:04$.
Please E-mail this message and the output from "tt (0x13144d0)"
to the author or to <email address hidden> Thread CAS-client (0x13144d0) suspended

causing all the network buffers to get used up and all network-related activities to die, e.g:

callbackRequest: cbHigh ring buffer full
NTPTimeSync: NTP requests failing - S_sntpcLib_TIMEOUT DB CA Link Exception: "Virtual circuit unresponsive", context "164.54.2.121:5064"
save_restore:do_seq - '/iocinfo/saverestore/iocsrvid2/ srvid2AutosaveV3.sav' not found. Writing a new one. [090604-101331]
../online_notify.c: CA beacon (send to "164.54.3.255:5065") error was "S_errno_ENOBUFS"

iocsrvid2> tt 0x13144d0
133288 vxTaskEntry +68 : 19290c0 ()
1929118 epicsThreadOnceOsd+168: camsgtask ()
1864ab4 camsgtask +260: camessage ()
1868ba8 camessage +2f0: 1867bb8 ()
1867e3c rsrvFreePutNotify+880: db_post_single_event () 188629c db_post_single_event+38 : 1885f84 () 1886080 db_post_extra_labor+194: epicsAssert ()
1928234 epicsAssert +f8 : epicsThreadSuspendSelf ()
1929274 epicsThreadSuspendSelf+14 : taskSuspend () value = 0 = 0x0

This is a PowerPC running vxWorks 5.5.2 and Base R3.14.10. The IOC had to be rebooted. This is the second or third time we've seen ENOBUFS recently, although the engineers didn't see (look for) assertion failure before this.

Original Mantis Bug: mantis-342
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=342

Tags: ca rsrv
Revision history for this message
Jeff Hill (johill-lanl) wrote :

This caused only one thread to stop so I doubt that the CA server is consuming an abnormal number of MBUFs. Each TCP circuit has a hard quota. Since NTP time sync is failing also it sounds like a network stack issue is the precipitating event.

Nevertheless, this could indicate some kind of bug in the event queue occurring only when the network stack is clobbered. Another possibility; some errant code clobbered the data structures of both the ca server and the network stack.

I will need to have a careful look at the source code.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> This caused only one thread to stop so I doubt that the CA server is
> consuming an abnormal number of MBUFs. Each TCP circuit has a hard quota.
> Since NTP time sync is failing also it sounds like a network stack
> issue is the precipitating event.

The thread was in state SUSPEND+I which means its priority was inherited due to it owning a semaphore that some higher priority task wanted. The output from inetstatShow some time later (see below) demonstrates that the network buffers were probably all being used up by requests from CA clients not being handled.

iocsrvid2> inetstatShow
Active Internet connections (including servers)
PCB Proto Recv-Q Send-Q Local Address Foreign Address (state)
-------- ----- ------ ------ ----------------- ------------------ -------
1d4c958 TCP 8056 0 164.54.3.153.5064 164.54.3.75.45183 ESTABLISHED
1d4c8d4 TCP 8760 0 164.54.3.153.5064 164.54.3.75.45174 ESTABLISHED
1d4c850 TCP 8760 0 164.54.3.153.5064 164.54.3.75.45164 ESTABLISHED
1d4c7cc TCP 8760 0 164.54.3.153.5064 164.54.3.75.45155 ESTABLISHED
1d4c748 TCP 8760 0 164.54.3.153.5064 164.54.3.75.45146 ESTABLISHED
1d4c6c4 TCP 8760 0 164.54.3.153.5064 164.54.3.75.45136 ESTABLISHED
1d4c640 TCP 8760 0 164.54.3.153.5064 164.54.3.75.45127 ESTABLISHED
1d4c094 TCP 8760 0 164.54.3.153.5064 164.54.3.75.45118 ESTABLISHED
1d4c118 TCP 2484 0 164.54.3.153.5064 164.54.3.75.45102 CLOSE_WAIT
1d4bf8c TCP 0 0 164.54.3.153.5064 164.54.1.62.53928 ESTABLISHED
1d4c430 TCP 0 0 164.54.3.153.5064 164.54.3.62.1067 ESTABLISHED
1d4c010 TCP 0 0 164.54.3.153.5064 164.54.2.123.1034 ESTABLISHED
1d4c4b4 TCP 0 0 164.54.3.153.5064 164.54.2.217.41003 ESTABLISHED
1d4c3ac TCP 2560 0 164.54.3.153.5064 164.54.1.62.36421 ESTABLISHED
1d4c19c TCP 0 0 164.54.3.153.5064 164.54.2.253.52591 ESTABLISHED
1d4ba64 TCP 0 0 0.0.0.0.5064 0.0.0.0.0 LISTEN
1d4bd7c UDP 0 0 0.0.0.0.673 0.0.0.0.0
1d4c328 UDP 0 0 0.0.0.0.674 0.0.0.0.0
1d4c2a4 UDP 0 0 164.54.3.153.1032 0.0.0.0.0
1d4c220 UDP 0 0 127.0.0.1.1031 127.0.0.1.1026
1d4bf08 UDP 0 0 0.0.0.0.894 0.0.0.0.0
1d4be00 UDP 0 0 0.0.0.0.1030 0.0.0.0.0
1d4bcf8 UDP 0 0 0.0.0.0.5065 0.0.0.0.0
1d4bc74 UDP 0 0 0.0.0.0.1027 0.0.0.0.0
1d4bbf0 UDP 0 0 0.0.0.0.1026 0.0.0.0.0
1d4bb6c UDP 0 0 0.0.0.0.5064 0.0.0.0.0
1d4bae8 UDP 0 0 164.54.3.153.1028 164.54.3.255.5065
1d4b9e0 UDP 0 0 0.0.0.0.1025 0.0.0.0.0
1d4b95c UDP 0 0 0.0.0.0.898 0.0.0.0.0
1d4b8d8 UDP 0 0 0.0.0.0.18322 0.0.0.0.0
value = 1 = 0x1

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> The output from inetstatShow some time later (see below) demonstrates that the
> network buffers were probably all being used up by requests from CA clients not
> being handled.

Since the receive side is full this does point to some issue upstream of the IP kernel.

> PCB Proto Recv-Q Send-Q Local Address Foreign Address (state)
> 1d4c958 TCP 8056 0 164.54.3.153.5064 164.54.3.75.45183

It's significant that at least 8 of the server's receive threads are not keeping up with their requests. That might indicate that a global lock was compromised or deadlocked in the server or the database. That might also indicate that a high priority thread is using all of the CPU (possibly a device driver or tNetTask).

A stack trace from one of the stuck receive threads in the server might be very useful information.

Furthermore, one would expect that the send queue, and not the receive queue, would be full if something odd was going on with dbEvent.c. The normal quiescent condition which runs 24/7 is a more or less empty send queue. However, if the system is down on MBUFs then one could imagine that the dbEvent.c issue occurred in the past when a TCP output queue was temporarily stalled?

edited on: 2009-06-04 16:46

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I had a look at the dbEvent.c code and the basic algorithm appears to be correct, and should be exercised quite often if there is just about any client that’s slow on the uptake. There are a number of modes surrounding how this operates. In particular the number of subscriptions on a particular event queue. So there is the possibility of an untested mode. Nevertheless, this is used in many situations.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Has this problem ever been seen outside of one particular IOC? I can’t help but notice that many things are misbehaving other than this assert. For example, eight of the server’s input threads are either CPU starved or unable to get a lock that they need. I also see “callbackRequest: cbHigh ring buffer full” which might happen if device support is posting IO interrupt events at a high rate.

edited on: 2009-07-01 17:09

Andrew Johnson (anj)
Changed in epics-base:
status: New → Incomplete
Andrew Johnson (anj)
Changed in epics-base:
importance: High → Low
tags: added: cleanup
Changed in epics-base:
status: Incomplete → Invalid
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Another occurrence?

https://epics.anl.gov/tech-talk/2020/msg00894.php

callbackRequest: cbLow ring buffer full
A call to 'assert(ev_que->evque[ev_que->putix] == EVENTQEMPTY)'
    by thread 'CAS-client' failed in ../../../src/ioc/db/dbEvent.c line 769.
Dumping a stack trace of thread 'CAS-client':
[ 0x54b86b]: ../../bin/linux-x86_64/CryD(epicsStackTrace+0x4b)
[ 0x5467fa]: ../../bin/linux-x86_64/CryD(epicsAssert+0x3a)
[ 0x4daa3b]: ../../bin/linux-x86_64/CryD(db_queue_event_log+0xcb)
[ 0x4dac8d]: ../../bin/linux-x86_64/CryD(db_post_single_event+0x3d)
[ 0x4ffce7]: ../../bin/linux-x86_64/CryD(event_add_action+0x1b7)
[ 0x5005de]: ../../bin/linux-x86_64/CryD(camessage+0x9e)
[ 0x4fd00c]: ../../bin/linux-x86_64/CryD(camsgtask+0x12c)
[ 0x54789c]: ../../bin/linux-x86_64/CryD(start_routine+0xdc)
[ 0x7f3c14cdf064]: /lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xc4)
[ 0x7f3c13b9c62d]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)
EPICS Release EPICS R3.15.5.
Local time is 2020-05-03 13:45:51.091659821 EDT
Please E-mail this message to the author or to tech-talk at aps.anl.gov
Calling epicsThreadSuspendSelf()
Thread CAS-client (0x7f3bc4016f20) suspended
callbackRequest: cbLow ring buffer full

Changed in epics-base:
status: Invalid → Incomplete
Revision history for this message
mdavidsaver (mdavidsaver) wrote :
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

For the two instances reached through event_add_action(), I wonder if this is a use-after-free race with a client which is disconnecting. Use of freelibLib could keep this from being detected, or obviously crashing. Perhaps related to asComputeAsg(), and thus casAccessRightsCB(), which is one way in which RSRV clients can interact with each other directly.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Running softIoc with a dynamic ACF file, and disabling/bypassing freeListLib, turns up a candidate.

> Possible data race during write of size 8 at 0x542B6C8 by thread #19
> Locks held: 4, at addresses 0x53AFE20 0x53B9880 0x54230C0 0x542BBE0
> at 0x48FFB23: db_queue_event_log (dbEvent.c:824)
> by 0x48FFDB8: db_post_single_event (dbEvent.c:912)
> by 0x493B440: casAccessRightsCB (camessage.c:1114)
> by 0x49F111C: asComputePvt (asLibRoutines.c:1043)
> by 0x49F0E7D: asComputeAsgPvt (asLibRoutines.c:969)
> by 0x49EF50B: asComputeAsg (asLibRoutines.c:484)
> by 0x48DB897: eventCallback (asCa.c:161)
> by 0x49AD242: oldSubscription::current(epicsGuard<epicsMutex>&, unsigned int, unsigned long, void const*) (oldSubscription.cpp:67)
> by 0x490F291: dbContext::callStateNotify(dbChannel*, unsigned int, unsigned long, db_field_log const*, cacStateNotify&) (dbContext.cpp:189)
> by 0x4913209: dbChannelIO::callStateNotify(unsigned int, unsigned long, db_field_log const*, cacStateNotify&) (dbChannelIO.h:110)
> by 0x4912FB8: dbSubscriptionEventCallback (dbSubscriptionIO.cpp:118)
> by 0x490009B: event_read (dbEvent.c:997)
>
> This conflicts with a previous read of size 8 by thread #23
> Locks held: none
> at 0x4900074: event_read (dbEvent.c:998)
> by 0x4900332: event_task (dbEvent.c:1076)
> by 0x4A183CD: start_routine (osdThread.c:412)
> by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
> by 0x4F8FFA2: start_thread (pthread_create.c:486)
> by 0x4EBE4CE: clone (clone.S:95)
> Address 0x542b6c8 is 1,192 bytes inside a block of size 2,432 alloc'd
> at 0x4838BB5: calloc (vg_replace_malloc.c:752)
> by 0x4A014A6: freeListCalloc (freeListLib.c:95)
> by 0x48FE995: db_init_events (dbEvent.c:304)
> by 0x49388B0: create_tcp_client (caservertask.c:1482)
> by 0x49351B5: req_server (caservertask.c:99)
> by 0x4A183CD: start_routine (osdThread.c:412)
> by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
> by 0x4F8FFA2: start_thread (pthread_create.c:486)
> by 0x4EBE4CE: clone (clone.S:95)
> Block was alloc'd by thread #20

Revision history for this message
mdavidsaver (mdavidsaver) wrote (last edit ):

As I read it, access to event_ext::pdbev (cf. server.h) should be guarded by client::eventqLock lock, but isn't always. I also have some question about the validity of the event_block being pointed to. It looks like this could be invalidated by a concurrent db_cancel_event().

Changed in epics-base:
status: Incomplete → Triaged
tags: added: rsrv
removed: 3.14 cleanup
Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :
Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :
Revision history for this message
mdavidsaver (mdavidsaver) wrote :
Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote :

This bug may be caused by not using spin locks in epicsRingPointer*.

I had back-ported the EPICS 7 implementation of osdSpin, epicsRingPointer* and callback.c. I got the impression that helped.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I may understand what could cause this assert() to fail.

I think the assumption behind the reservation logic (4 events per subscriber * 36 subscribes per ev_que) is violated between a call to db_cancel_event() when quota is decremented, and event_task cleaning out the cancelled events to make that capacity available.

wee https://github.com/epics-base/epics-base/pull/424#issuecomment-1728457392

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.