Healthcheck: agent crash @ InstanceTaskExecvp::ReadData

Bug #1533627 reported by Senthilnathan Murugappan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Fix Committed
High
Prabhjot Singh Sethi

Bug Description

Observed the below crash with 2696 kilo build. Box had one HTTP and one Ping healthcheck instance.
The core will be copied to /cs-shared/bugs/<bug_id>
(gdb) bt
#0 0x0000000000a33eb9 in close (ec=..., impl=..., this=0x313538333038365a) at /usr/include/boost/asio/detail/impl/reactive_descriptor_service.ipp:128
#1 close (ec=..., impl=..., this=0x3135383330383632) at /usr/include/boost/asio/posix/stream_descriptor_service.hpp:129
#2 close (ec=..., this=0x7fa5fc2ca4d8) at /usr/include/boost/asio/posix/basic_descriptor.hpp:222
#3 InstanceTaskExecvp::ReadData (this=0x7fa5fc2ca450, ec=..., read_bytes=<optimized out>) at controller/src/vnsw/agent/oper/instance_task.cc:34
#4 0x0000000000a35342 in operator() (a2=<optimized out>, a1=..., p=<optimized out>, this=0x7fff13d8c960) at /usr/include/boost/bind/mem_fn_template.hpp:280
#5 operator()<boost::_mfi::mf2<void, InstanceTaskExecvp, const boost::system::error_code&, long unsigned int>, boost::_bi::list2<const boost::system::error_code&, long unsigned int const&> > (a=<synthetic pointer>, f=..., this=0x7fff13d8c970) at /usr/include/boost/bind/bind.hpp:392
#6 operator()<boost::system::error_code, long unsigned int> (a2=@0x7fff13d8c988: 0, a1=..., this=0x7fff13d8c960) at /usr/include/boost/bind/bind_template.hpp:102
#7 operator() (this=0x7fff13d8c960) at /usr/include/boost/asio/detail/bind_handler.hpp:127
#8 asio_handler_invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf2<void, InstanceTaskExecvp, boost::system::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<InstanceTaskExecvp*>, boost::arg<1> (*)(), boost::arg<2> (*)()> >, boost::system::error_code, unsigned long> > (function=...)
    at /usr/include/boost/asio/handler_invoke_hook.hpp:64
#9 invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf2<void, InstanceTaskExecvp, boost::system::error_code const&, unsigned long>, boost::_b
i::list3<boost::_bi::value<InstanceTaskExecvp*>, boost::arg<1> (*)(), boost::arg<2> (*)()> >, boost::system::error_code, unsigned long>, boost::_bi::bind_t<void, boost
::_mfi::mf2<void, InstanceTaskExecvp, boost::system::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<InstanceTaskExecvp*>, boost::arg<1> (*)(), boost::arg<2> (*)()> > > (context=..., function=...) at /usr/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#10 boost::asio::detail::descriptor_read_op<boost::asio::mutable_buffers_1, boost::_bi::bind_t<void, boost::_mfi::mf2<void, InstanceTaskExecvp, boost::system::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<InstanceTaskExecvp*>, boost::arg<1> (*)(), boost::arg<2> (*)()> > >::do_complete (owner=0x2e6d190,
    base=<optimized out>) at /usr/include/boost/asio/detail/descriptor_read_op.hpp:104
#11 0x00000000009d3e57 in complete (bytes_transferred=0, ec=..., owner=..., this=0x7fa5d404d420) at /usr/include/boost/asio/detail/task_io_service_operation.hpp:37
#12 do_run_one (ec=..., this_thread=..., lock=..., this=0x2e6d190) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:384
#13 boost::asio::detail::task_io_service::run (this=0x2e6d190, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:153
#14 0x000000000105ec51 in run (this=0x2e6d120, ec=...) at /usr/include/boost/asio/impl/io_service.ipp:66
#15 EventManager::RunWithExceptionHandling (this=0x2e6d120) at controller/src/io/event_manager.cc:51
#16 0x00000000007b437e in main (argc=<optimized out>, argv=0x7fff13d8d678) at controller/src/vnsw/agent/contrail/main.cc:115
(gdb)

Tags: vrouter
Revision history for this message
Senthilnathan Murugappan (msenthil) wrote :

Observing this frequently when a healthcheck instance is detached from the VMI

Revision history for this message
Prabhjot Singh Sethi (prabhjot) wrote :

issue happens due to parallel excess to health check instance from two threads

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/16523
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/16523
Committed: http://github.org/Juniper/contrail-controller/commit/4a4bb3d5b8af10b15a2c4776ea077778c51284bb
Submitter: Zuul
Branch: master

commit 4a4bb3d5b8af10b15a2c4776ea077778c51284bb
Author: Prabhjot Singh Sethi <email address hidden>
Date: Tue Jan 26 23:51:34 2016 +0530

Fix Healtcheck instance parallel access & cleanup

Issue:
------
Health check instance is getting access from asio and
DBtable task context causing race condition to access
object and delete it at the same time.

Fix:
----
- move operation for READ and EXIT to a new HealthCheck
task context which runs in exclusion with DBTable task
- move cleanup of instance from DBTable to HealthCheck
task context to put events in correct sequence
- instance holds reference to service object to assure
sanity of access till cleanup is complete

Closes-Bug: 1533627
Related-Bug: 1530539
Change-Id: I2880a2c21a8a642bd6612067be5b67ba02c88fe8

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.