[R2.20 Build 21] Agent crash @ <WorkQueue<Sandesh*>::ProcessLowWaterMarks

Bug #1456457 reported by chhandak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
chhandak
Trunk
Fix Committed
Medium
chhandak

Bug Description

Trigger
-------------
Pumping 1 million mac with 500K PPS from IXIA side.

Backtrace
----------------
(gdb) bt
#0 0x00007febf4a99cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007febf4a9d0d8 in __GI_abort () at abort.c:89
#2 0x00007febf4a92b86 in __assert_fail_base (fmt=0x7febf4be3830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x1dd8e1f "count <= wm_info.count_", file=file@entry=0x1dd89f8 "controller/src/base/queue_task.h",
    line=line@entry=448,
    function=function@entry=0x1dd92a0 <WorkQueue<Sandesh*>::ProcessLowWaterMarks(unsigned long)::__PRETTY_FUNCTION__> "void WorkQueue<QueueEntryT>::ProcessLowWaterMarks(size_t) [with QueueEntryT = Sandesh*; size_t = long unsigned int]") at assert.c:92
#3 0x00007febf4a92c32 in __GI___assert_fail (assertion=0x1dd8e1f "count <= wm_info.count_",
    file=0x1dd89f8 "controller/src/base/queue_task.h", line=448,
    function=0x1dd92a0 <WorkQueue<Sandesh*>::ProcessLowWaterMarks(unsigned long)::__PRETTY_FUNCTION__> "void WorkQueue<QueueEntryT>::ProcessLowWaterMarks(size_t) [with QueueEntryT = Sandesh*; size_t = long unsigned int]") at assert.c:101
#4 0x0000000001bc87f0 in WorkQueue<Sandesh*>::ProcessLowWaterMarks (this=0x7febcc00abf0, count=10002)
    at controller/src/base/queue_task.h:448
#5 0x0000000001bc7926 in WorkQueue<Sandesh*>::Dequeue (this=0x7febcc00abf0, entry=0x7febed51cb20)
    at controller/src/base/queue_task.h:267
#6 0x0000000001bc565f in QueueTaskRunner<Sandesh*, WorkQueue<Sandesh*> >::RunQueue (this=0x7febbeacf5a0)
    at controller/src/base/queue_task.h:72
#7 0x0000000001bc480a in QueueTaskRunner<Sandesh*, WorkQueue<Sandesh*> >::Run (this=0x7febbeacf5a0)
    at controller/src/base/queue_task.h:57
#8 0x0000000001ca575c in TaskImpl::execute (this=0x7febee2dbc40) at controller/src/base/task.cc:232
#9 0x00007febf5668b3a in ?? () from /usr/lib/libtbb.so.2
#10 0x00007febf5664816 in ?? () from /usr/lib/libtbb.so.2
#11 0x00007febf5663f4b in ?? () from /usr/lib/libtbb.so.2
#12 0x00007febf56600ff in ?? () from /usr/lib/libtbb.so.2
#13 0x00007febf56602f9 in ?? () from /usr/lib/libtbb.so.2
#14 0x00007febf5884182 in start_thread (arg=0x7febed51d700) at pthread_create.c:312
#15 0x00007febf4b5d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Setup
------------
env.roledefs = {
    'all': [host1, host2, host3, host4, host5, host6],
    'cfgm': [host1, host2, host3],
    'openstack': [host1, host2, host3],
    'webui': [host2],
    'control': [host1, host3],
    'compute': [host4, host5, host6],
    'tsn': [host4, host5],
    'toragent': [host4, host5],
    'collector': [host1, host3],
    'database': [host1, host2, host3],
    'build': [host_build],
}

env.hostnames = {
    'all': ['nodei6', 'nodei7', 'nodei8', 'nodei9', 'nodei10', 'nodei19']
}

Revision history for this message
chhandak (chhandak) wrote :
Revision history for this message
Hari Prasad Killi (haripk) wrote :
Changed in juniperopenstack:
assignee: nobody → Megh Bhatt (meghb)
tags: added: analytics
Revision history for this message
Megh Bhatt (meghb) wrote :

I'm not able to get the bt from the core to confirm that it is indeed a dup or not, it seems that contrail-vrouter-agent was privately built by haripk?

root@nodei10:~# strings /var/tmp/nodei10_core.contrail-vroute.12748.nodei10.1431978044 | grep build-info
{"build-info": [{"build-time": "2015-05-15 09:23:33.316896", "build-hostname": "nodeb6", "build-git-ver": "d46e40f", "build-user": "haripk", "build-version": "2.20"}]}
{"build-info":[{"build-time":"2015-05-15 09:23:33.316896","build-hostname":"nodeb6","build-git-ver":"d46e40f","build-user":"haripk","build-version":"2.20","build-id":"2.20-21","build-number":"21"}]}
{"build-info":[{"build-time":"2015-05-15 09:23:33.316896","build-hostname":"nodeb6","build-git-ver":"d46e40f","build-user":"haripk","build-version":"2.20","build-id":"2.20-21","build-number":"21"}]}

oot@nodei10:~# gdb /var/tmp/contrail-vrouter-agent /var/tmp/nodei10_core.contrail-vroute.12748.nodei10.1431978044
GNU gdb (Ubuntu 7.7-0ubuntu3) 7.7
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /var/tmp/contrail-vrouter-agent...done.

warning: core file may not match specified executable file.
[New LWP 15119]
[New LWP 15139]
[New LWP 15122]
[New LWP 15137]
[New LWP 15160]
[New LWP 15127]
[New LWP 15121]
[New LWP 15171]
[New LWP 15120]
[New LWP 15154]
[New LWP 15180]
[New LWP 12748]
[New LWP 15156]
[New LWP 15117]
[New LWP 15147]
[New LWP 15181]
[New LWP 15177]
[New LWP 15183]
[New LWP 15179]
[New LWP 15123]
[New LWP 15143]
[New LWP 15126]
[New LWP 15173]
[New LWP 15169]
[New LWP 15176]
[New LWP 15116]
[New LWP 15185]
[New LWP 15178]
[New LWP 15150]
[New LWP 15144]
[New LWP 15118]
[New LWP 15170]
[New LWP 15166]
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007febf4a99cc9 in ?? ()
(gdb) bt
#0 0x00007febf4a99cc9 in ?? ()
#1 0x00007febf4a9d0d8 in ?? ()
#2 0x0000000000000020 in ?? ()
#3 0x0000000000000000 in ?? ()
(gdb) q

Changed in juniperopenstack:
assignee: Megh Bhatt (meghb) → chhandak (chhandak)
status: New → Incomplete
milestone: none → r2.30-fcs
importance: Undecided → Medium
Raj Reddy (rajreddy)
Changed in juniperopenstack:
status: Incomplete → Invalid
Revision history for this message
Megh Bhatt (meghb) wrote :
Download full text (17.4 KiB)

Seen on Harshad's setup - b3s13 and got the proper bt, this is different than https://bugs.launchpad.net/juniperopenstack/+bug/1453236

root@b3s13:~# gdb /var/tmp/contrail-vrouter-agent /var/crashes/core.contrail-vroute.2209.b3s13.1433973699
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /var/tmp/contrail-vrouter-agent...done.

warning: core file may not match specified executable file.
[New LWP 3103]
[New LWP 3106]
[New LWP 3108]
[New LWP 3105]
[New LWP 3110]
[New LWP 6583]
[New LWP 3109]
[New LWP 3112]
[New LWP 3111]
[New LWP 3114]
[New LWP 3116]
[New LWP 3115]
[New LWP 3118]
[New LWP 3117]
[New LWP 3121]
[New LWP 3122]
[New LWP 3113]
[New LWP 3125]
[New LWP 2209]
[New LWP 3170]
[New LWP 3124]
[New LWP 3104]
[New LWP 3119]
[New LWP 3120]
[New LWP 3171]
[New LWP 3123]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f23bde20cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) f 5
#5 Dequeue (entry=0x7f23b74a6b10, this=0x7f2358029f20) at controller/src/base/queue_task.h:267
267 controller/src/base/queue_task.h: No such file or directory.
(gdb) p this->low_water_
$1 = {
  <std::_Vector_base<WaterMarkInfo, std::allocator<WaterMarkInfo> >> = {
    _M_impl = {
      <std::allocator<WaterMarkInfo>> = {
        <__gnu_cxx::new_allocator<WaterMarkInfo>> = {<No data fields>}, <No data fields>},
      members of std::_Vector_base<WaterMarkInfo, std::allocator<WaterMarkInfo> >::_Vector_impl:
      _M_start = 0x7f235806a210,
      _M_finish = 0x7f235806a288,
      _M_end_of_storage = 0x7f235806a2b0
    }
  }, <No data fields>}
(gdb) pvector this->low_water_
elem[0]: $2 = {
  count_ = 2500,
  cb_ = {
    <boost::function1<void, unsigned long>> = {
      <boost::function_base> = {
        vtable = 0x107e571 <void boost::function1<void, unsigned long>::assign_to<boost::_bi::bind_t<void, void (*)(unsigned long, SandeshLevel::type), boost::_bi::list2<boost::arg<1>, boost::_bi::value<SandeshLevel::type> > > >(boost::_bi::bind_t<void, void (*)(unsigned long, SandeshLevel::type), boost::_bi::list2<boost::arg<1>, boost::_bi::value<SandeshLevel::type> > >)::stored_vtable+1>,
        functor = {
          obj_ptr = 0xedbd10 <Sandesh::SetSendingLevel(unsigned long, SandeshLevel:...

Changed in juniperopenstack:
status: Invalid → New
assignee: chhandak (chhandak) → Megh Bhatt (meghb)
Revision history for this message
Megh Bhatt (meghb) wrote :

The issue seems to be that ProcessLowWaterMarks() and ProcessHighWaterMarks() in WorkQueue can run in parallel and both of them read and can update hwater_index_ and lwater_index_ and this (reading and updating hwater_index_ and lwater_index_) needs to be happening atomically which currently it does not.

information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/11573
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11573
Committed: http://github.org/Juniper/contrail-controller/commit/9e6722ae6f9648d44511dc6527801f5702ae7cbc
Submitter: Zuul
Branch: R2.20

commit 9e6722ae6f9648d44511dc6527801f5702ae7cbc
Author: Megh Bhatt <email address hidden>
Date: Fri Jun 12 15:22:26 2015 -0700

In WorkQueue, ProcessLowWaterMarks() and ProcessHighWaterMarks()
can be run in parallel during Dequeue() and Enqueue() respectively.
Both these functions read and update the high and low water indexes
(hwater_index_ and lwater_index_) and require that the read of and
updates to both the indexes are atomic. Add a mutex so that the
updates and read of both these indexes are atomic. Add unit tests
to test the same.
Closes-Bug: #1456457

Change-Id: I0fa89347e2bb526a36e256f754dc28c2ccede717

Revision history for this message
Megh Bhatt (meghb) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.