Agent crash due to failed concurreny check while freeing flow entry

Bug #1714371 reported by Jingjun
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.0
New
High
Hari Prasad Killi
R3.0.3.x
Won't Fix
High
Hari Prasad Killi
R3.1
Fix Committed
High
Hari Prasad Killi
R3.2
Fix Committed
High
Hari Prasad Killi
R3.2.3.x
Fix Committed
High
Hari Prasad Killi
R4.0
Fix Committed
High
Hari Prasad Killi
R4.1
Fix Committed
High
Hari Prasad Killi
Trunk
Fix Committed
High
Hari Prasad Killi

Bug Description

Customer found unexpected reset of vrouter agent. We setup core generation and get the core files yesterday.
Can you please analyze what is the root cause of segment fault? Thanks a lot.

Core files:
ftp://ftp.juniper.net/pub/incoming/2017-0824-0719/rdm5r16c015.tar.gz
other logs are also inside this directory.

supervisord-vrouter.log
2017-08-08 18:59:46,152 INFO supervisord started with pid 8056
2017-08-08 18:59:47,154 INFO spawned: 'contrail-vrouter-nodemgr' with pid 8137
2017-08-08 18:59:47,155 INFO spawned: 'contrail-vrouter-dpdk' with pid 8138
2017-08-08 18:59:47,156 INFO spawned: 'contrail-vrouter-agent' with pid 8139
2017-08-08 18:59:48,535 INFO success: contrail-vrouter-nodemgr entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-08-08 18:59:48,535 INFO success: contrail-vrouter-dpdk entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-08-08 18:59:49,738 INFO stopped: contrail-vrouter-agent (terminated by SIGKILL)
2017-08-08 18:59:50,742 INFO spawned: 'contrail-vrouter-agent' with pid 8279
2017-08-08 18:59:55,813 INFO success: contrail-vrouter-agent entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-08-24 01:11:57,357 INFO exited: contrail-vrouter-agent (terminated by SIGSEGV (core dumped); not expected)
2017-08-24 01:11:58,360 INFO spawned: 'contrail-vrouter-agent' with pid 21364
2017-08-24 01:12:03,359 INFO success: contrail-vrouter-agent entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-08-24 01:29:08,359 INFO exited: contrail-vrouter-agent (terminated by SIGSEGV (core dumped); not expected)
2017-08-24 01:29:08,740 INFO spawned: 'contrail-vrouter-agent' with pid 41136
2017-08-24 01:29:14,031 INFO success: contrail-vrouter-agent entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-08-24 01:40:39,840 INFO exited: contrail-vrouter-agent (terminated by SIGSEGV (core dumped); not expected)
2017-08-24 01:40:40,843 INFO spawned: 'contrail-vrouter-agent' with pid 7524
2017-08-24 01:40:45,984 INFO success: contrail-vrouter-agent entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-08-24 03:09:06,665 INFO exited: contrail-vrouter-agent (terminated by SIGSEGV (core dumped); not expected)
2017-08-24 03:09:07,668 INFO spawned: 'contrail-vrouter-agent' with pid 23434
2017-08-24 03:09:12,733 INFO success: contrail-vrouter-agent entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-08-24 03:49:26,608 INFO exited: contrail-vrouter-agent (terminated by SIGSEGV (core dumped); not expected)
2017-08-24 03:49:26,991 INFO spawned: 'contrail-vrouter-agent' with pid 21699

contrail-vrouter-agent-stdout.log
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory
error: failed to connect to the hypervisor
error: no valid connection

contrail-vrouter-dpdk-stdout.log
2017-08-24 01:11:56,479 USOCK: Error reading FD 68: Connection reset by peer (104)
2017-08-24 01:11:59,622 DPCORE: vrouter soft reset start
2017-08-24 01:11:59,622 VROUTER: Deleting vif 0 eth device

Revision history for this message
Jingjun (jingjunzhang) wrote :
Download full text (18.0 KiB)

root@sriov:~/var/log# cd coredump/
root@sriov:~/var/log/coredump# ls
core.contrail-vroute.24365.rdm5r16c015.rdm5b.cci.att.com.1504119226 core.contrail-vroute.37795.rdm5r16c015.rdm5b.cci.att.com.1504046049 core.contrail-vroute.5744.rdm5r16c015.rdm5b.cci.att.com.1504110311
core.contrail-vroute.34655.rdm5r16c015.rdm5b.cci.att.com.1504116100 core.contrail-vroute.41156.rdm5r16c015.rdm5b.cci.att.com.1504128467
root@sriov:~/var/log/coredump# date -d @1504119226
Wed Aug 30 11:53:46 PDT 2017
root@sriov:~/var/log/coredump# date -d @1504046049
Tue Aug 29 15:34:09 PDT 2017
root@sriov:~/var/log/coredump# date -d @1504110311
Wed Aug 30 09:25:11 PDT 2017
root@sriov:~/var/log/coredump# date -d @1504116100
Wed Aug 30 11:01:40 PDT 2017
root@sriov:~/var/log/coredump# date -d @1504128467
Wed Aug 30 14:27:47 PDT 2017
root@sriov:~/var/log/coredump#

root@sriov:~/var/log/coredump# strings core.contrail-vroute.37795.rdm5r16c015.rdm5b.cci.att.com.1504046049 | grep build-info
{"build-info": [{"build-time": "2017-06-29 09:59:25.240327", "build-hostname": "ubuntu", "build-user": "contrail-builder", "build-version": "3.2.3.2"}]}
{"build-info":[{"build-time":"2017-06-29 09:59:25.240327","build-hostname":"ubuntu","build-user":"contrail-builder","build-version":"3.2.3.2","build-id":"3.2.3.2-6","build-number":"6"}]}
{"build-infoA
{"build-info":[{"build-time":"2017-06-29 09:59:25.240327","build-hostname":"ubuntu","build-user":"contrail-builder","build-version":"3.2.3.2","build-id":"3.2.3.2-6","build-number":"6"}]}
{"build-info": [{"build-time": "2017-06-29 10:47:18.086620", "build-hostname": "ubuntu", "build-user": "contrail-builder", "build-version": "3.2.3.2"}]}
root@sriov:~/var/log/coredump# contrail-version
Package Version Build-ID | Repo | Package Name
-------------------------------------- ------------------------------ ----------------------------------
contrail-analytics 3.2.3.2-6 6
contrail-config 3.2.3.2-6 6
contrail-config-openstack 3.2.3.2-6 6
contrail-control 3.2.3.2-6 6
contrail-database-common 3.2.3.2-6 6
contrail-dns 3.2.3.2-6 6
contrail-docs 3.2.3.2-6 6
contrail-f5 3.2.3.2-6 6
contrail-fabric-utils 3.2.3.2-6 6
contrail-heat 3.2.3.2-6 6
contrail-install-packages 3.2.3.2-6~mitaka 6
contrail-lib 3.2.3.2-6 6
contrail-nodemgr 3.2.3.2-6 6
contrail-nova-netw...

Revision history for this message
Jingjun (jingjunzhang) wrote :

Hi Hari, can you please have someone in your team to analyze this? Thank you so much.

Changed in juniperopenstack:
assignee: nobody → Hari Prasad Killi (haripk)
importance: Undecided → High
tags: added: vrouter
Jim Reilly (jpreilly)
information type: Proprietary → Private
tags: added: att-aic-contrail
Revision history for this message
Hari Prasad Killi (haripk) wrote :

#3 0x00007f133a6beca2 in __GI___assert_fail (assertion=0x13944d0 "table_->ConcurrencyCheck(table_->flow_task_id()) == true",
    file=0x1394448 "controller/src/vnsw/agent/pkt/flow_table.cc", line=919,
    function=0x1394940 <FlowEntryFreeList::Free(FlowEntry*)::__PRETTY_FUNCTION__> "void FlowEntryFreeList::Free(FlowEntry*)") at assert.c:101
#4 0x0000000000cec455 in FlowEntryFreeList::Free (this=this@entry=0x7f132c604a40, flow=flow@entry=0x7f1318e8f0c0) at controller/src/vnsw/agent/pkt/flow_table.cc:919
#5 0x0000000000cdd528 in intrusive_ptr_release (fe=0x7f1318e8f0c0) at controller/src/vnsw/agent/pkt/flow_entry.cc:497
#6 0x0000000000cf7199 in ~intrusive_ptr (this=0x7f1320438ae0, __in_chrg=<optimized out>) at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:97
#7 ~FlowMgmtRequest (this=0x7f1320438ad0, __in_chrg=<optimized out>) at controller/src/vnsw/agent/pkt/flow_mgmt_request.h:68
#8 FlowMgmtRequest::~FlowMgmtRequest (this=0x7f1320438ad0, __in_chrg=<optimized out>) at controller/src/vnsw/agent/pkt/flow_mgmt_request.h:68
#9 0x000000000087bcce in boost::detail::sp_counted_base::release (this=0x7f1320841a30) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:146
#10 0x0000000000cfd435 in ~shared_count (this=0x7f1333b27a58, __in_chrg=<optimized out>) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:371
#11 ~shared_ptr (this=0x7f1333b27a50, __in_chrg=<optimized out>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:328
#12 QueueTaskRunner<boost::shared_ptr<FlowMgmtRequest>, WorkQueue<boost::shared_ptr<FlowMgmtRequest> > >::RunQueue (this=0x7f1320605630) at controller/src/base/queue_task.h:108
#13 0x000000000130560f in TaskImpl::execute (this=0x7f1333cf3e40) at controller/src/base/task.cc:262

Revision history for this message
Hari Prasad Killi (haripk) wrote :

Waiting for this issue to be hit with an instrumented agent binary.

Jim Reilly (jpreilly)
tags: added: blocker
summary: - dpdk computes vrouter crashing
+ Agent crash while freeing flow entry
summary: - Agent crash while freeing flow entry
+ Agent crash due to failed concurreny check while freeing flow entry
Revision history for this message
Sreelakshmi (ssarva) wrote :

@Hari - As per Qasim this is not a valid configuration of using /16 for AAP for this VNF - vCTS. we need to let ATT know that /16 is not allowed for AAP.

Thanks
Sree

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/36224
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2.3.x

Review in progress for https://review.opencontrail.org/36225
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/36224
Committed: http://github.com/Juniper/contrail-controller/commit/23f0ddaa2b26405fa368c38cfde9417386fd6e66
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit 23f0ddaa2b26405fa368c38cfde9417386fd6e66
Author: Naveen N <email address hidden>
Date: Wed Oct 4 16:08:26 2017 +0530

* Read reverse flow entry after acquiring the lock

In case of short lived TCP session using same 5 tuple we might
end up accesssing a reverse flow entry which might be already deleted
and present in free list. Below set of events lead to this issue

1. F1 and R1 flow are added to flow table
2. R1 is written to vrouter
3. F1 is written to vrouter
4. R1 flow add response is received, triggering update of
F1(not needed now as reverse flow index is not written to
kernel?)
5. In the meantime flow is evicted in vrouter, hence flow
update for F1 would result in error from vrouter resulting in short flow
6. Since F1 is shortflow Flow delete gets enqueued
7. Since R1 is evict marked, flow evict gets enqueued
8. Both event F1 and R1 delete and evict event can run in
parallel,and hence reverse flow pointer obtained before FLOW
lock could be invalid, hence read back the same

Was able to simulate the same in UT, UT is disabled as it need some
instrumentation to run it to hit the issue

Change-Id: I368f465f9d446b43dfe0e4baa547d8c8cbfd6840
Closes-bug: #1714371

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/36244
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/36244
Committed: http://github.com/Juniper/contrail-controller/commit/a0cb1666537424d81d6989e87c781f241d9ef41b
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit a0cb1666537424d81d6989e87c781f241d9ef41b
Author: Naveen N <email address hidden>
Date: Wed Oct 4 16:08:26 2017 +0530

* Read reverse flow entry after acquiring the lock

In case of short lived TCP session using same 5 tuple we might
end up accesssing a reverse flow entry which might be already deleted
and present in free list. Below set of events lead to this issue

1. F1 and R1 flow are added to flow table
2. R1 is written to vrouter
3. F1 is written to vrouter
4. R1 flow add response is received, triggering update of
F1(not needed now as reverse flow index is not written to
kernel?)
5. In the meantime flow is evicted in vrouter, hence flow
update for F1 would result in error from vrouter resulting in short flow
6. Since F1 is shortflow Flow delete gets enqueued
7. Since R1 is evict marked, flow evict gets enqueued
8. Both event F1 and R1 delete and evict event can run in
parallel,and hence reverse flow pointer obtained before FLOW
lock could be invalid, hence read back the same

Was able to simulate the same in UT, UT is disabled as it need some
instrumentation to run it to hit the issue

Change-Id: I368f465f9d446b43dfe0e4baa547d8c8cbfd6840
Closes-bug: #1714371

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.1

Review in progress for https://review.opencontrail.org/36408
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/36408
Committed: http://github.com/Juniper/contrail-controller/commit/b9773f1187249b041fa490dd1eaf1080e9a4dda7
Submitter: Zuul (<email address hidden>)
Branch: R4.1

commit b9773f1187249b041fa490dd1eaf1080e9a4dda7
Author: Naveen N <email address hidden>
Date: Wed Oct 4 16:08:26 2017 +0530

* Read reverse flow entry after acquiring the lock

In case of short lived TCP session using same 5 tuple we might
end up accesssing a reverse flow entry which might be already deleted
and present in free list. Below set of events lead to this issue

1. F1 and R1 flow are added to flow table
2. R1 is written to vrouter
3. F1 is written to vrouter
4. R1 flow add response is received, triggering update of
F1(not needed now as reverse flow index is not written to
kernel?)
5. In the meantime flow is evicted in vrouter, hence flow
update for F1 would result in error from vrouter resulting in short flow
6. Since F1 is shortflow Flow delete gets enqueued
7. Since R1 is evict marked, flow evict gets enqueued
8. Both event F1 and R1 delete and evict event can run in
parallel,and hence reverse flow pointer obtained before FLOW
lock could be invalid, hence read back the same

Was able to simulate the same in UT, UT is disabled as it need some
instrumentation to run it to hit the issue

Change-Id: I368f465f9d446b43dfe0e4baa547d8c8cbfd6840
Closes-bug: #1714371
(cherry picked from commit a0cb1666537424d81d6989e87c781f241d9ef41b)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/36447
Submitter: Naveen N (<email address hidden>)

information type: Private → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/36447
Committed: http://github.com/Juniper/contrail-controller/commit/4bcecc0547f5886bf0a3c8645817437d87d312e4
Submitter: Zuul (<email address hidden>)
Branch: master

commit 4bcecc0547f5886bf0a3c8645817437d87d312e4
Author: Naveen N <email address hidden>
Date: Wed Oct 4 16:08:26 2017 +0530

* Read reverse flow entry after acquiring the lock

In case of short lived TCP session using same 5 tuple we might
end up accesssing a reverse flow entry which might be already deleted
and present in free list. Below set of events lead to this issue

1. F1 and R1 flow are added to flow table
2. R1 is written to vrouter
3. F1 is written to vrouter
4. R1 flow add response is received, triggering update of
F1(not needed now as reverse flow index is not written to
kernel?)
5. In the meantime flow is evicted in vrouter, hence flow
update for F1 would result in error from vrouter resulting in short flow
6. Since F1 is shortflow Flow delete gets enqueued
7. Since R1 is evict marked, flow evict gets enqueued
8. Both event F1 and R1 delete and evict event can run in
parallel,and hence reverse flow pointer obtained before FLOW
lock could be invalid, hence read back the same

Was able to simulate the same in UT, UT is disabled as it need some
instrumentation to run it to hit the issue

Change-Id: I368f465f9d446b43dfe0e4baa547d8c8cbfd6840
Closes-bug: #1714371
(cherry picked from commit a0cb1666537424d81d6989e87c781f241d9ef41b)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/36829
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/36829
Committed: http://github.com/Juniper/contrail-controller/commit/1754efe51606c68aafba8ce06d280a95247c7816
Submitter: Zuul (<email address hidden>)
Branch: R3.1

commit 1754efe51606c68aafba8ce06d280a95247c7816
Author: Naveen N <email address hidden>
Date: Wed Oct 4 16:08:26 2017 +0530

* Read reverse flow entry after acquiring the lock

In case of short lived TCP session using same 5 tuple we might
end up accesssing a reverse flow entry which might be already deleted
and present in free list. Below set of events lead to this issue

1. F1 and R1 flow are added to flow table
2. R1 is written to vrouter
3. F1 is written to vrouter
4. R1 flow add response is received, triggering update of
F1(not needed now as reverse flow index is not written to
kernel?)
5. In the meantime flow is evicted in vrouter, hence flow
update for F1 would result in error from vrouter resulting in short flow
6. Since F1 is shortflow Flow delete gets enqueued
7. Since R1 is evict marked, flow evict gets enqueued
8. Both event F1 and R1 delete and evict event can run in
parallel,and hence reverse flow pointer obtained before FLOW
lock could be invalid, hence read back the same

Was able to simulate the same in UT, UT is disabled as it need some
instrumentation to run it to hit the issue

Change-Id: I368f465f9d446b43dfe0e4baa547d8c8cbfd6840
Closes-bug: #1714371
(cherry picked from commit 23f0ddaa2b26405fa368c38cfde9417386fd6e66)

Revision history for this message
Hari Prasad Killi (haripk) wrote :

Based on the theory that eviction and delete dont run in parallel in R3.0.3, not picking this fix for that branch.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/36225
Committed: http://github.com/Juniper/contrail-controller/commit/bba3d9c87beb51b00e482eeb8a0f51e2d7916a50
Submitter: Zuul (<email address hidden>)
Branch: R3.2.3.x

commit bba3d9c87beb51b00e482eeb8a0f51e2d7916a50
Author: Naveen N <email address hidden>
Date: Wed Oct 4 16:08:26 2017 +0530

* Read reverse flow entry after acquiring the lock

In case of short lived TCP session using same 5 tuple we might
end up accesssing a reverse flow entry which might be already deleted
and present in free list. Below set of events lead to this issue

1. F1 and R1 flow are added to flow table
2. R1 is written to vrouter
3. F1 is written to vrouter
4. R1 flow add response is received, triggering update of
F1(not needed now as reverse flow index is not written to
kernel?)
5. In the meantime flow is evicted in vrouter, hence flow
update for F1 would result in error from vrouter resulting in short flow
6. Since F1 is shortflow Flow delete gets enqueued
7. Since R1 is evict marked, flow evict gets enqueued
8. Both event F1 and R1 delete and evict event can run in
parallel,and hence reverse flow pointer obtained before FLOW
lock could be invalid, hence read back the same

Was able to simulate the same in UT, UT is disabled as it need some
instrumentation to run it to hit the issue

Change-Id: I368f465f9d446b43dfe0e4baa547d8c8cbfd6840
Closes-bug: #1714371

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.