contrail-control crash in UpdateQueue::Dequeue from BgpExport::Export

Bug #1536729 reported by Nischal Sheth
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Nischal Sheth
R2.21.x
Fix Committed
High
Nischal Sheth
R2.22.x
Fix Released
High
Nischal Sheth
Trunk
Fix Committed
High
Nischal Sheth

Bug Description

R2.20 build 64

A crash is seen occasionally when a number of vrouter-agents restart.

#0 next_node (node=<synthetic pointer>) at /usr/include/boost/intrusive/detail/tree_algorithms.hpp:446
#1 next_node (p=<synthetic pointer>) at /usr/include/boost/intrusive/rbtree_algorithms.hpp:352
#2 operator++ (this=<synthetic pointer>) at /usr/include/boost/intrusive/detail/tree_node.hpp:128
#3 erase (i=..., this=0x7f2430200770) at /usr/include/boost/intrusive/rbtree.hpp:822
#4 erase (i=..., this=0x7f2430200770) at /usr/include/boost/intrusive/set.hpp:574
#5 UpdateQueue::Dequeue (this=0x7f2430200720, rt_update=rt_update@entry=0x7f2430343050) at controller/src/bgp/bgp_update_queue.cc:64
#6 0x00000000006655f7 in RibUpdateMonitor::GetRouteUpdateAndDequeue(DBEntryBase*, RouteUpdate*, boost::function<bool (RouteUpdate const*)>, bool*) (this=this@entry=0x7f2430200a80, db_entry=db_entry@entry=0x7f24147c1780, rt_update=rt_update@entry=0x7f2430343050, cmp=..., duplicate=duplicate@entry=0x7f24517f4a10) at controller/src/bgp/bgp_update_monitor.cc:64
#7 0x000000000066579e in RibUpdateMonitor::GetDBStateAndDequeue(DBEntryBase*, boost::function<bool (RouteUpdate const*)>, bool*) (this=this@entry=0x7f2430200a80, db_entry=db_entry@entry=0x7f24147c1780, cmp=..., duplicate=duplicate@entry=0x7f24517f4a10) at controller/src/bgp/bgp_update_monitor.cc:190
#8 0x000000000071b5dd in BgpExport::Export (this=0x7f2430200800, root=0x7f243013e320, db_entry=0x7f24147c1780) at controller/src/bgp/bgp_export.cc:64
#9 0x0000000000a4ce0a in operator() (a1=0x7f24147c1780, a0=0x7f243013e320, this=0x7f24517f4ac0) at /usr/include/boost/function/function_template.hpp:767
#10 RunNotify (entry=0x7f24147c1780, tpart=0x7f243013e320, this=0x7f2430027a70) at controller/src/db/db_table.cc:114
#11 DBTableBase::RunNotify (this=<optimized out>, tpart=tpart@entry=0x7f243013e320, entry=entry@entry=0x7f24147c1780) at controller/src/db/db_table.cc:198
#12 0x0000000000a4f8f8 in DBTablePartBase::RunNotify (this=this@entry=0x7f243013e320) at controller/src/db/db_table_partition.cc:46
#13 0x0000000000a4b7cd in DBPartition::QueueRunner::Run (this=0x7f2438110a60) at controller/src/db/db_partition.cc:199
#14 0x0000000000ab5720 in TaskImpl::execute (this=0x7f245940ee40) at controller/src/base/task.cc:238

Nischal Sheth (nsheth)
information type: Proprietary → Public
Vineet Jalali (vjalali)
tags: added: att-aic
removed: contrail-control
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/16434
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/16435
Submitter: Nischal Sheth (<email address hidden>)

Nischal Sheth (nsheth)
tags: added: contrail-control
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/16435
Committed: http://github.org/Juniper/contrail-controller/commit/57d156c7d8f186df311804fe904edfc16393772a
Submitter: Zuul
Branch: R2.20

commit 57d156c7d8f186df311804fe904edfc16393772a
Author: Nischal Sheth <email address hidden>
Date: Fri Jan 22 11:14:44 2016 -0800

Fix corner case in join processing

This problem can happen when a number of vRouter agents restart in
quick succesion and subscribe to a table and advertise routes into
the table.

Consider the following sequence of events on a route in foo.inet.0:

- Route is added with a path with attribute A and nexthop N1
- XMPP peer P1 subscribes to the table
- As a result a RouteUpdate is created in QBULK (the join queue). The
RouteUpdate has an UpdateInfo with a bitset P1 and a RibOutAttr with
attribute A and nexthop N1
- Another path for the same route is added with nexthop N2
- This new path is ecmp eligible i.e. has same local preference as the
best path
- XMPP peer P2 subscribes to the table
- As a result the existing RouteUpdate in QBULK gets a new UpdateInfo.
A new UpdateInfo is created because RibOutAttr for P2 has attribute A
and nexthops (N1, N2). Note that a different UpdateInfo is required
for different RibOutAttr. This UpdateInfo has a bitset P2.

Notice that we now have a RouteUpdate with 2 UpdateInfos that have the
same attribute A, but different RibOutAttrs (by virtue having different
forwarding nexthops).

The UpdateQueue maintains a set (attr_set_ of type UpdatesByAttr) of
UpdateInfos keyed by BgpAttr and timestamp. The label/nexthops in the
RibOutAttr are not included in key to achieve optimal packing of bgp
updates by attribute. As a result, both the UpdateInfos are inserted
(or rather attempted to be inserted) into attr_set_ with the same key.
This causes a crash when we later try to erase both UpdateInfos from
the attr_set_ when doing export processing for the route.

Note that we run into this case only if join processing for P2 happens
before export processing for the route after the 2nd path got added.
If export processing happens before the join for P2, the RouteUpdate
would move from QBULK to QUPDATE and would have only 1 UpdateInfo with
attribute A and nexthop (N1, N2).

The fix consists of 2 parts:

1. Use the UpdateInfo pointer itself as the final tie-breaker in the
key for UpdateQueue::UpdatesByAttr to ensure we have no duplicates.

2. When traversing the UpdatesByAttr set to build update messages, fix
UpdateQueue::AttrNext to not return an UpdateInfo for same RouteUpdate
as the current UpdateInfo. Doing so invalidates the locking design in
RibOutUpdates and results in a deadlock.

Add unit tests to recreate the above scenario and verify the fix.

Change-Id: I45ce1bbd72d8b6a163a5aa61358491cc1d3f6a93
Closes-Bug: 1536729

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/16434
Committed: http://github.org/Juniper/contrail-controller/commit/3fbaeb9f053c4c894e04221a47cbb33d6a36ed3e
Submitter: Zuul
Branch: master

commit 3fbaeb9f053c4c894e04221a47cbb33d6a36ed3e
Author: Nischal Sheth <email address hidden>
Date: Fri Jan 22 11:14:44 2016 -0800

Fix corner case in join processing

This problem can happen when a number of vRouter agents restart in
quick succesion and subscribe to a table and advertise routes into
the table.

Consider the following sequence of events on a route in foo.inet.0:

- Route is added with a path with attribute A and nexthop N1
- XMPP peer P1 subscribes to the table
- As a result a RouteUpdate is created in QBULK (the join queue). The
RouteUpdate has an UpdateInfo with a bitset P1 and a RibOutAttr with
attribute A and nexthop N1
- Another path for the same route is added with nexthop N2
- This new path is ecmp eligible i.e. has same local preference as the
best path
- XMPP peer P2 subscribes to the table
- As a result the existing RouteUpdate in QBULK gets a new UpdateInfo.
A new UpdateInfo is created because RibOutAttr for P2 has attribute A
and nexthops (N1, N2). Note that a different UpdateInfo is required
for different RibOutAttr. This UpdateInfo has a bitset P2.

Notice that we now have a RouteUpdate with 2 UpdateInfos that have the
same attribute A, but different RibOutAttrs (by virtue having different
forwarding nexthops).

The UpdateQueue maintains a set (attr_set_ of type UpdatesByAttr) of
UpdateInfos keyed by BgpAttr and timestamp. The label/nexthops in the
RibOutAttr are not included in key to achieve optimal packing of bgp
updates by attribute. As a result, both the UpdateInfos are inserted
(or rather attempted to be inserted) into attr_set_ with the same key.
This causes a crash when we later try to erase both UpdateInfos from
the attr_set_ when doing export processing for the route.

Note that we run into this case only if join processing for P2 happens
before export processing for the route after the 2nd path got added.
If export processing happens before the join for P2, the RouteUpdate
would move from QBULK to QUPDATE and would have only 1 UpdateInfo with
attribute A and nexthop (N1, N2).

The fix consists of 2 parts:

1. Use the UpdateInfo pointer itself as the final tie-breaker in the
key for UpdateQueue::UpdatesByAttr to ensure we have no duplicates.

2. When traversing the UpdatesByAttr set to build update messages, fix
UpdateQueue::AttrNext to not return an UpdateInfo for same RouteUpdate
as the current UpdateInfo. Doing so invalidates the locking design in
RibOutUpdates and results in a deadlock.

Add unit tests to recreate the above scenario and verify the fix.

Change-Id: I45ce1bbd72d8b6a163a5aa61358491cc1d3f6a93
Closes-Bug: 1536729

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/16686
Submitter: Vinay Vithal Mahuli (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged
Download full text (8.5 KiB)

Reviewed: https://review.opencontrail.org/16686
Committed: http://github.org/Juniper/contrail-controller/commit/156ad0b760f9b532572116d813d7afa695555bea
Submitter: Zuul
Branch: R2.22.x

commit 156ad0b760f9b532572116d813d7afa695555bea
Author: Atul Moghe <email address hidden>
Date: Mon Dec 21 14:29:14 2015 -0800

Cherry pick controller commits from R2.20 to R2.22.x
updating version.info from 2.22 to 2.23 in 2.20 branch
Closes-Bug:#1528370

Change-Id: Ic649422979a926cc5f5b8457c01610b848dc206b

Storage stats daemon fix

Partial-Bug: #1528327
Fixed latency monitor code based on the Ceph 0.94.3 version.
Fixed issues in OSD throughput/IOPs calculation.
Updated code based on the latest Sandesh apis.

Change-Id: I12caf951f84c8b213b1b5ec01371bb68b4c48cb3

Fix contrail-collector back pressure mechanism

contrail-collector DB queue back presssure mechanism was not
working since the DB drop level is initialized to INVALID and
even the water marks levels are INVALID and hence the defer/undefer
callbacks are not called.

Change-Id: Ib28141a69aeed3c4ad6f50abbaed2a285e3e7db2
Partial-Bug: #1528380

Fix Agent crash for flow index tree management

Issue:
------
During a flow index change vrouter-agent triggers a delete
on index tree using new flow handle instead of currently
held flow_handle resulting in flow entry getting associated
to two slots in the flow index tree, which further on flow
entry delete due to aging or eviction never releases the
slot for old flow handle, causing failures for further
insertions in the flow index tree

Fix:
----
Avoid taking flow handle as argument to DeleteByIndex and
use the currently associated flow_handle to remove from tree
Adding assert in DeleteByIndex to catch delete failure
Avoid doing delete from index tree in code paths other than
flow entry index update of flow entry delete.

Add logic for KSync Sock User to Mock vrouter behavior
returning index for an entry if it is already allocated
instead of allocating a new one.

Closes-Bug: 1527425
Change-Id: I10e77fb59650acfdd924a5f1d35d6b8dea03a3f0

Fix discovery dependency issue. Originally made in master branch
via https://review.opencontrail.org/#/c/15749

Change-Id: I5d874de3714074c66fa73bfd7c9119772dc681fd
Partial-Bug: #1530186

Avoid calling get_routing_instances on VN object

Calling get_routing_instances could trigger another read of the VN
if the VN has no routing instance. This is not only inefficient, but
could also cause exception if the VN has disappeared. We can avoid
this by calling getattr.

Change-Id: Ie5500585b9e6c578576276c2c04ec03f32c75112
Partial-Bug: 1528950

Fix Centos 65 agent compilation issues.
Closes-Bug: #1532159

Change-Id: Ia8b77619c80737000d5bd949534c9e0a16967359

Closes-Bug: #1524063, contrail-status is showing contrail-web-ui, even it is not configured, in case of SMLite

Change-Id: I55afc19140b1ce52b3b529a644124705de5ce6a8

Fix a corner case with routing instance delete

Sequence of event that causes the crash
1. Static route config deleted
2. Static Route maanger triggers resolve_trigger_ to re-evaluate static
route config
3. Before the resolve trigger is invoked routing instance is deleted

Resolve trigger calls ProcessStaticRouteConfi...

Read more...

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Verified manually on Contrail Solution test:

Traffic test with 10 tenants, ~135 VMs, ~60 BMS; all VMs/BMS sending traffic to each other; features are vDNS/LLS/FIP/snat/lbaas; lots of flow setup/teardown; high b/w long lived flows; (3-4Gbps on each vRouter).

When the above traffic is ON, continuous vRouter restarts did not produce this contrail-control crash.

Note: Currently the service restart part of the test is not automated in Contrail Solution test. Automation of this part of test scheduled for 3.1.0.0 release. For releases before that, will execute the test manually.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/18583
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/18583
Committed: http://github.org/Juniper/contrail-controller/commit/3ceca229be3533febe7965b3fa2aa14ac863ea61
Submitter: Zuul
Branch: R2.21.x

commit 3ceca229be3533febe7965b3fa2aa14ac863ea61
Author: Nischal Sheth <email address hidden>
Date: Fri Jan 22 11:14:44 2016 -0800

Fix corner case in join processing

This problem can happen when a number of vRouter agents restart in
quick succesion and subscribe to a table and advertise routes into
the table.

Consider the following sequence of events on a route in foo.inet.0:

- Route is added with a path with attribute A and nexthop N1
- XMPP peer P1 subscribes to the table
- As a result a RouteUpdate is created in QBULK (the join queue). The
RouteUpdate has an UpdateInfo with a bitset P1 and a RibOutAttr with
attribute A and nexthop N1
- Another path for the same route is added with nexthop N2
- This new path is ecmp eligible i.e. has same local preference as the
best path
- XMPP peer P2 subscribes to the table
- As a result the existing RouteUpdate in QBULK gets a new UpdateInfo.
A new UpdateInfo is created because RibOutAttr for P2 has attribute A
and nexthops (N1, N2). Note that a different UpdateInfo is required
for different RibOutAttr. This UpdateInfo has a bitset P2.

Notice that we now have a RouteUpdate with 2 UpdateInfos that have the
same attribute A, but different RibOutAttrs (by virtue having different
forwarding nexthops).

The UpdateQueue maintains a set (attr_set_ of type UpdatesByAttr) of
UpdateInfos keyed by BgpAttr and timestamp. The label/nexthops in the
RibOutAttr are not included in key to achieve optimal packing of bgp
updates by attribute. As a result, both the UpdateInfos are inserted
(or rather attempted to be inserted) into attr_set_ with the same key.
This causes a crash when we later try to erase both UpdateInfos from
the attr_set_ when doing export processing for the route.

Note that we run into this case only if join processing for P2 happens
before export processing for the route after the 2nd path got added.
If export processing happens before the join for P2, the RouteUpdate
would move from QBULK to QUPDATE and would have only 1 UpdateInfo with
attribute A and nexthop (N1, N2).

The fix consists of 2 parts:

1. Use the UpdateInfo pointer itself as the final tie-breaker in the
key for UpdateQueue::UpdatesByAttr to ensure we have no duplicates.

2. When traversing the UpdatesByAttr set to build update messages, fix
UpdateQueue::AttrNext to not return an UpdateInfo for same RouteUpdate
as the current UpdateInfo. Doing so invalidates the locking design in
RibOutUpdates and results in a deadlock.

Add unit tests to recreate the above scenario and verify the fix.

Change-Id: I45ce1bbd72d8b6a163a5aa61358491cc1d3f6a93
Closes-Bug: 1536729

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.