[k8s-R5.0]: Agent crash observed on k8s sanity run

Bug #1764506 reported by Pulkit Tandon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Fix Released
Critical
Hari Prasad Killi
Trunk
Fix Committed
Critical
Nagendra E S

Bug Description

Configuration:
K8s 1.9.2
coat-5.0-15
Centos-7.4

Setup:
5 node setup.
1 Kube master. 3 Controller.
2 Agent+ K8s slaves

The issues was observed in a k8s sanity run:
LogsLocation : http://10.204.216.50/Docs/logs/5.0-15_2018_04_16_17_17_20_1523889114.59/logs/
Report : http://10.204.216.50/Docs/logs/5.0-15_2018_04_16_17_17_20_1523889114.59/junit-noframes.html

Agent crash observed on 1 of the node
```(gdb) bt full
#0 0x00007fe9e70491f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fe9e704a8e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fe9e7042266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fe9e7042312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c98ccf in NextHopTable::FreeInterfaceId(unsigned long) ()
No symbol table info available.
#5 0x0000000000c9efff in NextHop::~NextHop() ()
No symbol table info available.
#6 0x0000000000cad966 in CompositeNH::~CompositeNH() ()
No symbol table info available.
#7 0x0000000000ec77d8 in DBTablePartition::Remove(DBEntryBase*) ()
No symbol table info available.
#8 0x0000000000ec22e6 in DBPartition::QueueRunner::Run() ()
No symbol table info available.
#9 0x0000000000e9e44f in TaskImpl::execute() ()
No symbol table info available.
#10 0x00007fe9e7c188ca in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
No symbol table info available.
#11 0x00007fe9e7c145b6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
No symbol table info available.
#12 0x00007fe9e7c13c8b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
No symbol table info available.
#13 0x00007fe9e7c1167f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
No symbol table info available.
#14 0x00007fe9e7c11879 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
No symbol table info available.
#15 0x00007fe9e7e33e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#16 0x00007fe9e710c34d in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb)
```

Collector crash observed on one of the Controller.
```
(gdb) bt full
#0 0x00007fee763f4a37 in abort () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fee763ec266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fee763ec312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3 0x0000000000744b8d in OpServerProxy::DeleteUVEs(std::string const&, std::string const&, std::string const&, std::string const&) ()
No symbol table info available.
#4 0x0000000000649e46 in SandeshGenerator::DisconnectSession(VizSession*) ()
No symbol table info available.
#5 0x000000000063aeec in Collector::DisconnectSession(SandeshSession*) ()
No symbol table info available.
#6 0x000000000087863d in SandeshServerConnection::ProcessDisconnect(SandeshSession*) ()
No symbol table info available.
#7 0x0000000000875280 in ssm::Established::react(ssm::EvTcpClose const&) ()
No symbol table info available.
#8 0x00000000008755a8 in boost::statechart::simple_state<ssm::Established, SandeshStateMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*) ()
No symbol table info available.
#9 0x000000000087508b in boost::statechart::state_machine<SandeshStateMachine, ssm::Idle, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&) ()
No symbol table info available.
#10 0x000000000086d975 in SandeshStateMachine::DequeueEvent(SandeshStateMachine::EventContainer&) ()
No symbol table info available.
#11 0x0000000000874525 in QueueTaskRunner<SandeshStateMachine::EventContainer, WorkQueue<SandeshStateMachine::EventContainer> >::RunQueue() ()
No symbol table info available.
#12 0x000000000047781f in TaskImpl::execute() ()
No symbol table info available.
#13 0x00007fee777e18ca in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
No symbol table info available.
#14 0x00007fee777dd5b6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
No symbol table info available.
#15 0x00007fee777dcc8b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
No symbol table info available.
#16 0x00007fee777da67f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
No symbol table info available.
#17 0x00007fee777da879 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
No symbol table info available.
#18 0x00007fee779fce25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#19 0x00007fee764b634d in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb)

```

Dumps kept at following path:
/home/bhushana/Documents/technical/logs/5.0-15_2018_04_16_17_17_20_1523889114.59

This core is seen on running testheat.log with build ocata-master-259.

/cs-shared/bugs/1764506
contrail-lbaas-haproxy-stdout.log core.contrail-vroute.6612.nodei1.1535490809
contrail-vrouter-agent.log testheat.log
contrail-vrouter-nodemgr.log

Revision history for this message
Pulkit Tandon (pulkitt) wrote :

Core dumps are kept at:

Dumps kept at following path:
/home/bhushana/Documents/technical/logs/5.0-15_2018_04_16_17_17_20_1523889114.59

Server:
<email address hidden>

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/41999
Submitter: Hari Prasad Killi (<email address hidden>)

Pulkit Tandon (pulkitt)
summary: - [k8s-R5.0]: Agent and collector crash observed on k8s sanity run
+ [k8s-R5.0]: Agent crash observed on k8s sanity run
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/41999
Committed: http://github.com/Juniper/contrail-controller/commit/bfcc3706b4141fb261a9b11e3ff3132ab3230ed6
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit bfcc3706b4141fb261a9b11e3ff3132ab3230ed6
Author: Hari Prasad Killi <email address hidden>
Date: Tue Apr 17 14:10:51 2018 +0530

Reverting the change done to retain NH ids upon restart

There are cases where the same NH id is being allocated to multiple composite
NHs, resulting in crash during free. This change is not really used in R5.0
and hence reverting the same. Should be added back in the next release with
appropriate changes.

Change-Id: I5986759934f323933176ea83c21fa1da22a1fc8f
closes-bug: #1764506

Revision history for this message
Pulkit Tandon (pulkitt) wrote :

Not observing any crash on R5.0 sanity runs.
Hence closing this bug

Revision history for this message
vimal (vappachan) wrote :

Same core is seen in

bt:

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/contrail-vrouter-agent...Reading symbols from /usr/bin/contrail-vrouter-agent...(no dng symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 6622]
[New LWP 6625]
[New LWP 6620]
[New LWP 6626]
[New LWP 6628]
[New LWP 6627]
[New LWP 6624]
[New LWP 6612]
[New LWP 6621]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal 6, Aborted.
#0 0x00007fc58668b277 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install contrail-vrouter-agent-5.1.0-259.el7.x86_64
(gdb) bt
#0 0x00007fc58668b277 in raise () from /lib64/libc.so.6
#1 0x00007fc58668c968 in abort () from /lib64/libc.so.6
#2 0x00007fc586684096 in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007fc586684142 in __assert_fail () from /lib64/libc.so.6
#4 0x0000000000c852cf in NextHopTable::FreeInterfaceId(unsigned long) ()
#5 0x0000000000c8b79f in NextHop::~NextHop() ()
#6 0x0000000000c9a1d6 in CompositeNH::~CompositeNH() ()
#7 0x0000000000ea0648 in DBTablePartition::Remove(DBEntryBase*) ()
#8 0x0000000000e9b116 in DBPartition::QueueRunner::Run() ()
#9 0x0000000000e7727f in TaskImpl::execute() ()
#10 0x00007fc5872628ca in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(ask&, tbb::task*) ()
   from /lib64/libtbb.so.2
#11 0x00007fc58725e5b6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so
#12 0x00007fc58725dc8b in tbb::internal::market::process(rml::job&) ()
   from /lib64/libtbb.so.2
#13 0x00007fc58725b67f in tbb::internal::rml::private_worker::run() ()
   from /lib64/libtbb.so.2
#14 0x00007fc58725b879 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
#15 0x00007fc58747de25 in start_thread () from /lib64/libpthread.so.0
#16 0x00007fc586753bad in clone () from /lib64/libc.so.6

logs:
/cs-shared/bugs/1764506

Revision history for this message
vimal (vappachan) wrote :

Image : ocata-master-259

tags: added: sanityblocker
vimal (vappachan)
description: updated
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/46048
Submitter: Nagendra E S (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/46048
Committed: http://github.com/Juniper/contrail-controller/commit/0ac3c12da22267afd268916095febb8da744e9d1
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit 0ac3c12da22267afd268916095febb8da744e9d1
Author: Nagendra E S <email address hidden>
Date: Fri Sep 7 09:54:09 2018 +0530

Reverting the change done to retain NH ids upon restart

There are cases where the same NH id is being allocated to multiple composite
NHs, resulting in crash during free. This change is not really used
and hence reverting the same. Can be added back in the later with
necessary changes.

Change-Id: I357c1c17c98b08f75622dfc5c6ca89096a32f96b
closes-bug: #1764506

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.