Control node deadlock in RibOutUpdates::TailDequeue

Bug #1411855 reported by Nischal Sheth
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R1.1
Fix Committed
Critical
Nischal Sheth
R2.0
Fix Released
Critical
Nischal Sheth
R2.1
Fix Released
Critical
Nischal Sheth
Trunk
Fix Committed
Critical
Nischal Sheth
OpenContrail
Fix Released
Critical
Nischal Sheth

Bug Description

Control node deadlocked in RibOutUpdates::TailDequeue. This happened
when UpdatePack tried to acquire a lock on the next RouteUpdate in the
attr_set_ in UpdateQueue. This in turn happened since TailDequeue
already had a lock on the RouteUpdate in question since it was the previous one in the (temporal) queue_ in UpdateQueue.

Root cause is that calling clock_gettime with CLOCK_MONOTONIC does not
guarantee monotonically increasing timestamps.

(gdb) info threads
  Id Target Id Frame
  4 Thread 0x7f86bd323700 (LWP 25182) "contrail-contro" 0x00007f86c4ea3f2c in __lll_lock_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
  3 Thread 0x7f86bcf22700 (LWP 25183) "contrail-contro" 0x00007f86c4ea3f2c in __lll_lock_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
  2 Thread 0x7f86bcb21700 (LWP 25184) "contrail-contro" 0x00007f86c4ea1414 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
* 1 Thread 0x7f86c69487c0 (LWP 25177) "contrail-contro" 0x00007f86c3f6e5fa in epoll_ctl () from /lib/x86_64-linux-gnu/libc.so.6

Thread 4 is the lock owner

(gdb) frame 5
#5 0x00000000007622df in RibUpdateMonitor::GetAttrNext (this=0x7f86b0096b50, queue_id=1, current_uinfo=0x7f86b0093510,
    next_uinfo_p=0x7f86bd322770) at controller/src/bgp/bgp_update_monitor.cc:973
973 RouteUpdatePtr update(mp, rt_update, &mutex_, &cond_var_);

Next queue element by attribute:
(gdb) p mp
$1 = (tbb::mutex *) 0x7f86b02eedb0
(gdb) p rt_update
$2 = (RouteUpdate *) 0x7f86b02eed90

#8 0x000000000071f95d in RibOutUpdates::TailDequeue (this=0x7f86b0096780, queue_id=1, msync=..., blocked=0x7f86bd322af0)
    at controller/src/bgp/bgp_ribout_updates.cc:187
187 if (!DequeueCommon(start_marker, update.get(), blocked)) {

The previously processed by Tail dequeue (which is still locked is):
(gdb) p next_update
$3 = {entry_mutexp_ = 0x7f86b02eedb0, rt_update_ = 0x7f86b02eed90, monitor_mutexp_ = 0x7f86b0096b50, cond_var_ = 0x7f86b0096b78}

Previous and current elements have the same timestamp:
(gdb) p update.rt_update_->tstamp_
$6 = 851624330824509
(gdb) p next_update.rt_update_->tstamp_
$7 = 851624330824509

This causes the order of the attribute queue to dependent on the UpdateInfo pointer.

Nischal Sheth (nsheth)
description: updated
Changed in opencontrail:
status: New → In Progress
Changed in juniperopenstack:
importance: Undecided → Critical
assignee: nobody → Nischal Sheth (nsheth)
status: New → In Progress
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/6290
Committed: http://github.org/Juniper/contrail-controller/commit/6bb5d55d43d1e37dec2372b21f09ee1dacbb8c42
Submitter: Zuul
Branch: R1.10

commit 6bb5d55d43d1e37dec2372b21f09ee1dacbb8c42
Author: Nischal Sheth <email address hidden>
Date: Fri Jan 16 13:08:02 2015 -0800

Fix deadlock in RibOutUpdates::TailDequeue/PeerDequeue

RibOutUpdates::TailDequeue/PeerDequeue can deadlock if 2 RouteUpdates
have the same timestamp. Calling clock_gettime with CLOCK_MONOTONIC
does not guarantee monotonically increasing timestamps.

Use atomic uint64_t to implement relative timestamp for RouteUpdate.

Change-Id: I328fc96405c51dace5b5e8a79b80026631a0bb4b
Partial-Bug: 1411855

tags: added: blocker
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/6287
Committed: http://github.org/Juniper/contrail-controller/commit/734fe0a4f2a53e527a0bebe94b0d41e5b677acad
Submitter: Zuul
Branch: R2.1

commit 734fe0a4f2a53e527a0bebe94b0d41e5b677acad
Author: Nischal Sheth <email address hidden>
Date: Fri Jan 16 13:08:02 2015 -0800

Fix deadlock in RibOutUpdates::TailDequeue/PeerDequeue

RibOutUpdates::TailDequeue/PeerDequeue can deadlock if 2 RouteUpdates
have the same timestamp. Calling clock_gettime with CLOCK_MONOTONIC
does not guarantee monotonically increasing timestamps.

Use atomic uint64_t to implement relative timestamp for RouteUpdate.

Change-Id: I328fc96405c51dace5b5e8a79b80026631a0bb4b
Partial-Bug: 1411855

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/6286
Committed: http://github.org/Juniper/contrail-controller/commit/3b7fcc22f259f12726c16496bb33658cec83fc7b
Submitter: Zuul
Branch: master

commit 3b7fcc22f259f12726c16496bb33658cec83fc7b
Author: Nischal Sheth <email address hidden>
Date: Fri Jan 16 13:08:02 2015 -0800

Fix deadlock in RibOutUpdates::TailDequeue/PeerDequeue

RibOutUpdates::TailDequeue/PeerDequeue can deadlock if 2 RouteUpdates
have the same timestamp. Calling clock_gettime with CLOCK_MONOTONIC
does not guarantee monotonically increasing timestamps.

Use atomic uint64_t to implement relative timestamp for RouteUpdate.

Change-Id: I328fc96405c51dace5b5e8a79b80026631a0bb4b
Partial-Bug: 1411855

Nischal Sheth (nsheth)
Changed in opencontrail:
status: In Progress → Fix Committed
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/6288
Committed: http://github.org/Juniper/contrail-controller/commit/98ad0f1f76c551da141730a44fc918d7acaff6d9
Submitter: Zuul
Branch: R2.0

commit 98ad0f1f76c551da141730a44fc918d7acaff6d9
Author: Nischal Sheth <email address hidden>
Date: Fri Jan 16 13:08:02 2015 -0800

Fix deadlock in RibOutUpdates::TailDequeue/PeerDequeue

RibOutUpdates::TailDequeue/PeerDequeue can deadlock if 2 RouteUpdates
have the same timestamp. Calling clock_gettime with CLOCK_MONOTONIC
does not guarantee monotonically increasing timestamps.

Use atomic uint64_t to implement relative timestamp for RouteUpdate.

Change-Id: I328fc96405c51dace5b5e8a79b80026631a0bb4b
Partial-Bug: 1411855

Nischal Sheth (nsheth)
Changed in opencontrail:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.