[3.1.0.0-6 ] Scale: control-node crash @ BgpPeer::FlushUpdate

Bug #1607617 reported by chhandak
44
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.0
Fix Committed
Critical
Nischal Sheth
R3.1
Fix Committed
Critical
Nischal Sheth
Trunk
Fix Committed
Critical
Nischal Sheth

Bug Description

Observed the crash while loading scaling config . Cluster was gettings called with VN , LIF and VMI.

BackTrace
------------
(gdb) bt
#0 0x000000000086cd2a in BgpPeer::FlushUpdate (this=0x7f006ce4a840) at controller/src/bgp/bgp_peer.cc:1394
#1 0x00000000008d9048 in RibOutUpdates::UpdateFlush (this=this@entry=0x7f0004d7bb50, dst=..., blocked=blocked@entry=0x7f00567f8b20)
    at controller/src/bgp/bgp_ribout_updates.cc:463
#2 0x00000000008da9e6 in RibOutUpdates::TailDequeue (this=0x7f0004d7bb50, queue_id=1, msync=..., blocked=0x7f00567f8b20)
    at controller/src/bgp/bgp_ribout_updates.cc:235
#3 0x0000000000933fc1 in SchedulingGroup::UpdateRibOut (this=0x7f0004e52500, ribout=0x7f0004e580e0, queue_id=1)
    at controller/src/bgp/scheduling_group.cc:1087
#4 0x0000000000939145 in SchedulingGroup::Worker::Run (this=0x7efffd550600) at controller/src/bgp/scheduling_group.cc:421
#5 0x00000000006aebbf in TaskImpl::execute (this=0x7f0070b38740) at controller/src/base/task.cc:262
#6 0x00007f008a642b3a in ?? () from /usr/lib/libtbb.so.2
#7 0x00007f008a63e816 in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f008a63df4b in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f008a63a0ff in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f008a63a2f9 in ?? () from /usr/lib/libtbb.so.2
#11 0x00007f008a85e182 in start_thread (arg=0x7f00567f9700) at pthread_create.c:312
#12 0x00007f008992f47d in setfsuid () at ../sysdeps/unix/syscall-template.S:81
#13 0x0000000000000000 in ?? ()

Revision history for this message
chhandak (chhandak) wrote :

Core copied in
ubuntu-build04:/auto/cores/1607617> ls -lrt
total 909644
-rwxrwxrwx 1 chhandak epbg 927817728 Jul 28 22:17 core.contrail-contro.6246.5b7s3.1469768182

Changed in juniperopenstack:
importance: Undecided → Critical
assignee: nobody → Nischal Sheth (nsheth)
information type: Proprietary → Public
chhandak (chhandak)
tags: added: contrail-control
Revision history for this message
Nischal Sheth (nsheth) wrote :

Looks like session_ pointer in BgpPeer is bogus.
Access is supposed to be serialized using spin_mutex_.
Not sure why it's bogus.

Revision history for this message
Nischal Sheth (nsheth) wrote :

May have same root cause as bug 1607627, which could
be buffer overrun.

Revision history for this message
Nischal Sheth (nsheth) wrote :

Root cause identified - working on a fix.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/22656
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/22657
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/22658
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/22656
Committed: http://github.org/Juniper/contrail-controller/commit/37498d0a58a6640a6f20530f9eeb87af45ea0dfd
Submitter: Zuul
Branch: R3.1

commit 37498d0a58a6640a6f20530f9eeb87af45ea0dfd
Author: Nischal Sheth <email address hidden>
Date: Fri Jul 29 09:32:26 2016 -0700

Fix buffer overrun in BgpPeer::SendUpdate/FlushUpdate

Since bgp::Send and bgp::StateMachine tasks can run concurrently, the
following sequence of events is possible:

1. SendUpdate is called a few times thus making the buffer close to full
2. State machine task runs and clears the session in the peer
3. SendUpdate is called again with a large message size
4. Causes FlushUpdate to be called as new message doesn't fit in buffer
5. FlushUpdate returns but does not clear buffer_len - this is the bug
6
. SendUpdate appends the large message to buffer thus causing overrun

Change-Id: I3a3687a2a9989239c084e1005d9db0cf3398f713
Closes-Bug: 1607617

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/22657
Committed: http://github.org/Juniper/contrail-controller/commit/4a3da1c2a13b455626fdb16aca4119a65524288c
Submitter: Zuul
Branch: master

commit 4a3da1c2a13b455626fdb16aca4119a65524288c
Author: Nischal Sheth <email address hidden>
Date: Fri Jul 29 09:32:26 2016 -0700

Fix buffer overrun in BgpPeer::SendUpdate/FlushUpdate

Since bgp::Send and bgp::StateMachine tasks can run concurrently, the
following sequence of events is possible:

1. SendUpdate is called a few times thus making the buffer close to full
2. State machine task runs and clears the session in the peer
3. SendUpdate is called again with a large message size
4. Causes FlushUpdate to be called as new message doesn't fit in buffer
5. FlushUpdate returns but does not clear buffer_len - this is the bug
6
. SendUpdate appends the large message to buffer thus causing overrun

Change-Id: I3a3687a2a9989239c084e1005d9db0cf3398f713
Closes-Bug: 1607617

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/22658
Committed: http://github.org/Juniper/contrail-controller/commit/7e21ca4997af024d2e05f791ad2d5cff84de753d
Submitter: Zuul
Branch: R3.0

commit 7e21ca4997af024d2e05f791ad2d5cff84de753d
Author: Nischal Sheth <email address hidden>
Date: Fri Jul 29 17:12:37 2016 -0700

Fix buffer overrun in BgpPeer::SendUpdate/FlushUpdate

Since bgp::Send and bgp::StateMachine tasks can run concurrently, the
following sequence of events is possible:

1. SendUpdate is called a few times thus making the buffer close to full
2. State machine task runs and clears the session in the peer
3. SendUpdate is called again with a large message size
4. Causes FlushUpdate to be called as new message doesn't fit in buffer
5. FlushUpdate returns but does not clear buffer_len - this is the bug
6
. SendUpdate appends the large message to buffer thus causing overrun

Change-Id: Iaef79f622162e758503694eeaf6efb31da16c9a0
Closes-Bug: 1607617

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.