Contrail-vrouter-agent status are timeout on both TSN nodes

Bug #1715061 reported by mehul
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.1
Fix Committed
Critical
Hari Prasad Killi
R3.1.1.x
Invalid
Critical
Hari Prasad Killi
R3.2
Fix Committed
Critical
Hari Prasad Killi
R3.2.3.x
In Progress
Critical
Hari Prasad Killi
R4.0
Invalid
Critical
Hari Prasad Killi
Trunk
Invalid
Critical
Hari Prasad Killi

Bug Description

Hi Team,

Issue: Contrail-vrouter-agent status are timeout at both TSN nodes.
Contrail-version: 3.1.3 Build81
Setup: LAB
Impact: Since customer cannot test any more in the LAB, they would like to treat this issue on high priority.

This issue occurred at Aug-31 00:57 UTC on both TSN nodes in customer setup.

Please find the output of contrail-status on the TSN node.

contrail-status
root@lab3adp-00004nn:/var/log/contrail# contrail-status --debug | grep -v "/var/crashes/core"
== Contrail vRouter ==
supervisor-vrouter: active
~~~ snip ~~~
contrail-tor-agent-9 active (ToR:lab3dal-00001n connection up)
contrail-vrouter-agent: DEFAULT/S.http_server_port not present
Timeout error : HTTPConnectionPool(host='localhost', port=8085): Read timed out. (read timeout=2) <<<<<<<<<<<<<<< here
contrail-vrouter-agent timeout
contrail-vrouter-nodemgr: DEFAULT/S.http_server_port not present
contrail-vrouter-nodemgr active

TCP connection:

tcp connection of vrouter-agent
root@lab3adp-00004nn:/var/log/contrail# netstat -na | grep 8085
tcp 0 0 0.0.0.0:8085 0.0.0.0:* LISTEN
tcp 205 0 127.0.0.1:8085 127.0.0.1:47110 CLOSE_WAIT
tcp 205 0 127.0.0.1:8085 127.0.0.1:48204 CLOSE_WAIT
tcp 205 0 127.0.0.1:8085 127.0.0.1:47615 CLOSE_WAIT

There is no update after 16:07 UTC in theh contrail-vrouter-agent.log

2017-08-31 Thu 16:07:33:948.952 UTC lab3adp-00004nn [Thread 139830359521024, Pid 29146]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Client: Event: Tcp Connected peer ip: 10.1.135.96 ( <email address hidden> ) controller/src/xmpp/xmpp_state_machine.cc 1331
2017-08-31 Thu 16:07:33:952.634 UTC lab3adp-00004nn [Thread 139830346925824, Pid 29146]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Client: Event: Tcp Connected peer ip: 10.1.135.97 ( <email address hidden> ) controller/src/xmpp/xmpp_state_machine.cc 1331
2017-08-31 Thu 16:07:33:956.338 UTC lab3adp-00004nn [Thread 139830372116224, Pid 29146]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Client: Event: Tcp Connected peer ip: 10.1.135.98 ( <email address hidden> ) controller/src/xmpp/xmpp_state_machine.cc 1331
2017-08-31 Thu 16:07:33:959.951 UTC lab3adp-00004nn [Thread 139830367917824, Pid 29146]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Client: Event: Tcp Connected peer ip: 10.1.135.97 ( <email address hidden> ) controller/src/xmpp/xmpp_state_machine.cc 1331
2017-08-31 Thu 16:17:03:901.242 UTC lab3adp-00004nn [Thread 139830359521024, Pid 29146]: DiscoveryClient [SYS_NOTICE]: DiscoveryClientLogMsg: Message Type: subscribe Service Name: xmpp-server Message: <xmpp-server><instances> 0</instances><min-instances>2</min-instances><client-type>contrail-vrouter-agent:0</client-type> <client>lab3adp-00004nn:contrail-vrouter-agent:0</client><remote-addr>10.1.135.75</remote-addr> <service-in-use-list><publisher-id>lb3np-cocr0001n< /publisher-id><publisher-id>lb3np-cocr0002n</publisher-id></service-in-use-list></xmpp-server> controller/src/discovery/client/discovery_client.cc 860

During this time, they also observed loss in BUM Traffic. In order to recover this issue, they have restarted supervisor-config restarted) (Aug-31 00:57 UTC). During this time, core files are generated for the tor-agent at the same time and after that vrouter-core files are generated frequently. Below is the output.

root@lab3adp-00004nn:/var/crashes# ls -lrt | grep "Aug 3" | grep vroute
-rw------- 1 root root 2058956800 Aug 31 00:57 core.contrail-vroute.7058.lab3adp-00004nn.1504141046
-rw------- 1 root root 1850613760 Aug 31 02:33 core.contrail-vroute.22083.lab3adp-00004nn.1504146825
-rw------- 1 root root 1766981632 Aug 31 02:54 core.contrail-vroute.33696.lab3adp-00004nn.1504148053
-rw------- 1 root root 1847115776 Aug 31 04:52 core.contrail-vroute.36165.lab3adp-00004nn.1504155126
-rw------- 1 root root 1787453440 Aug 31 06:53 core.contrail-vroute.675.lab3adp-00004nn.1504162398
-rw------- 1 root root 1743376384 Aug 31 07:14 core.contrail-vroute.14629.lab3adp-00004nn.1504163642
-rw------- 1 root root 1854148608 Aug 31 08:37 core.contrail-vroute.17139.lab3adp-00004nn.1504168653
-rw------- 1 root root 1752174592 Aug 31 09:20 core.contrail-vroute.26634.lab3adp-00004nn.1504171237
-rw------- 1 root root 1770536960 Aug 31 10:11 core.contrail-vroute.31585.lab3adp-00004nn.1504174259
-rw------- 1 root root 1739558912 Aug 31 11:57 core.contrail-vroute.37345.lab3adp-00004nn.1504180656
-rw------- 1 root root 1805520896 Aug 31 13:19 core.contrail-vroute.482.lab3adp-00004nn.1504185590
-rw------- 1 root root 1739997184 Aug 31 13:38 core.contrail-vroute.9939.lab3adp-00004nn.1504186729
-rw------- 1 root root 1770528768 Aug 31 15:05 core.contrail-vroute.12171.lab3adp-00004nn.1504191941
-rw------- 1 root root 1751678976 Aug 31 16:07 core.contrail-vroute.22170.lab3adp-00004nn.1504195645

Customer has shared the trace when the issue happened first, second and third time.

1. First Time:

root@lab3adp-00004nn:~# gdb contrail-vrouter-agent /var/crashes/core.contrail-vroute.7058.lab3adp-00004nn.1504141046
GNU gdb (Ubuntu 7.7-0ubuntu3) 7.7
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from contrail-vrouter-agent...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 7074]
[New LWP 7078]
[New LWP 7076]
[New LWP 13887]
[New LWP 7077]
[New LWP 7058]
[New LWP 7081]
[New LWP 7075]
[New LWP 7592]
[New LWP 13888]
[New LWP 7080]
[New LWP 7591]
[New LWP 7079]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000e37b66 in VrfExport::Notify(Agent const*, AgentXmppChannel*, DBTablePartBase*, DBEntryBase*) ()
(gdb) bt
#0 0x0000000000e37b66 in VrfExport::Notify(Agent const*, AgentXmppChannel*, DBTablePartBase*, DBEntryBase*) ()
#1 0x0000000000e362d8 in ControllerRouteWalker::VrfNotifyAll(DBTablePartBase*, DBEntryBase*) ()
#2 0x00000000011d28bf in DBTableWalker::Worker::Run() ()
#3 0x00000000013039e7 in TaskImpl::execute() ()
#4 0x00007f68666b1b3a in ?? () from /usr/lib/libtbb.so.2
#5 0x00007f68666ad816 in ?? () from /usr/lib/libtbb.so.2
#6 0x00007f68666acf4b in ?? () from /usr/lib/libtbb.so.2
#7 0x00007f68666a90ff in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f68666a92f9 in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f68668cd182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f6865ba647d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) quit

2. Second time

root@lab3adp-00004nn:~# gdb contrail-vrouter-agent /var/crashes/core.contrail-vroute.22083.lab3adp-00004nn.1504146825
GNU gdb (Ubuntu 7.7-0ubuntu3) 7.7
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from contrail-vrouter-agent...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 22106]
[New LWP 22100]
[New LWP 22103]
[New LWP 22616]
[New LWP 22102]
[New LWP 22615]
[New LWP 22105]
[New LWP 22083]
[New LWP 22107]
[New LWP 22104]
[New LWP 22101]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f33b4e49cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f33b4e49cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f33b4e4d0d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f33b4e42b86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f33b4e42c32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f33b5c36510 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x0000000000d470f3 in VnUveEntry::UpdatePortBitmap(unsigned char, unsigned short, unsigned short) ()
#6 0x0000000000d36033 in VnUveTable::UpdateBitmap(std::string const&, unsigned char, unsigned short, unsigned short) ()
#7 0x0000000000c71cb3 in FlowStatsCollector::NewFlow(FlowExportInfo const&) ()
#8 0x0000000000c746b3 in FlowStatsCollector::AddFlow(FlowExportInfo) ()
#9 0x0000000000c74a6e in FlowStatsCollector::RequestHandler(boost::shared_ptr<FlowExportReq>) ()
#10 0x0000000000c775bc in boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, FlowStatsCollector, boost::shared_ptr<FlowExportReq> >, boost::_bi::list2<boost::_bi::value<FlowStatsCollector*>, boost::arg<1> > >, bool, boost::shared_ptr<FlowExportReq> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<FlowExportReq>) ()
#11 0x0000000000c7c614 in QueueTaskRunner<boost::shared_ptr<FlowExportReq>, WorkQueue<boost::shared_ptr<FlowExportReq> > >::RunQueue() ()
#12 0x00000000013039e7 in TaskImpl::execute() ()
#13 0x00007f33b5a18b3a in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f33b5a14816 in ?? () from /usr

3. Third time

root@lab3adp-00004nn:~# gdb contrail-vrouter-agent /var/crashes/core.contrail-vroute.33696.lab3adp-00004nn.1504148053
GNU gdb (Ubuntu 7.7-0ubuntu3) 7.7
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from contrail-vrouter-agent...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 33716]
[New LWP 33719]
[New LWP 33696]
[New LWP 33717]
[New LWP 33807]
[New LWP 33808]
[New LWP 33718]
[New LWP 33713]
[New LWP 33715]
[New LWP 33720]
[New LWP 33714]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f85443f1cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f85443f1cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f85443f50d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f85443eab86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f85443eac32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f85451de510 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x0000000000d470f3 in VnUveEntry::UpdatePortBitmap(unsigned char, unsigned short, unsigned short) ()
#6 0x0000000000d36033 in VnUveTable::UpdateBitmap(std::string const&, unsigned char, unsigned short, unsigned short) ()
#7 0x0000000000c71cb3 in FlowStatsCollector::NewFlow(FlowExportInfo const&) ()
#8 0x0000000000c746b3 in FlowStatsCollector::AddFlow(FlowExportInfo) ()
#9 0x0000000000c74a6e in FlowStatsCollector::RequestHandler(boost::shared_ptr<FlowExportReq>) ()
#10 0x0000000000c775bc in boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, FlowStatsCollector, boost::shared_ptr<FlowExportReq> >, boost::_bi::list2<boost::_bi::value<FlowStatsCollector*>, boost::arg<1> > >, bool, boost::shared_ptr<FlowExportReq> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<FlowExportReq>) ()
#11 0x0000000000c7c614 in QueueTaskRunner<boost::shared_ptr<FlowExportReq>, WorkQueue<boost::shared_ptr<FlowExportReq> > >::RunQueue() ()
#12 0x00000000013039e7 in TaskImpl::execute() ()
#13 0x00007f8544fc0b3a in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f8544fbc816 in ?? () from /usr

I also have researched on this and found related the BUG-https://bugs.launchpad.net/juniperopenstack/+bug/1667999, but it seems that this is not exactly same since there are some different errors after line “contrail-vrouter-agent: DEFAULT/S.http_server_port not present”

Customer expectation is as below

1. What is the proper way to recover from this issue?
2. What is the root cause why both contrail-vrouter-agents are timeout?

The logs and core file are located at the below location

IP:10.219.48.123, root/Jtaclab123
Path: /home/mehul/2017-0831-0965

-Regards,
Mehul Patel

Tags: vrouter nttc
Changed in juniperopenstack:
importance: Undecided → Critical
tags: added: vrouter
Revision history for this message
mehul (pmehul) wrote :

Hi Team,

This issue occurred while there is an issue migration is in progress (meaning:Config sync from V1controller to V2controller)

As per them the CRUD is being carried out in parallel.

-Regards,
Mehul Patel

mehul (pmehul)
information type: Proprietary → Public
Revision history for this message
mehul (pmehul) wrote :
Download full text (5.2 KiB)

Hi Team,

Below is the sequence of the event happened.

They have implemented contrail ISSU in a lab environment

=============================================
-rw-r--r-- 1 root root 10984 Aug 29 05:18 issu_contrail_generate_conf_2017_08_29_05_18_25_896844.log
-rw-r--r-- 1 root root 32121 Aug 29 05:29 issu_contrail_migrate_config_2017_08_29_05_19_00_206072.log
-rw-r--r-- 1 root root 7337 Aug 29 09:53 install_pkg_node_2017_08_29_09_52_05_948152.log
-rw-r--r-- 1 root root 2984 Aug 29 09:54 create_install_repo_node_2017_08_29_09_54_05_434516.log
-rw-r--r-- 1 root root 361021 Aug 29 09:56 create_install_repo_node_2017_08_29_09_54_32_088583.log
-rw-r--r-- 1 root root 38913 Aug 29 09:59 migrate_compute_kernel_node_2017_08_29_09_57_40_635039.log
-rw-r--r-- 1 root root 5892 Aug 29 10:12 issu_contrail_switch_collector_in_compute_node_revert_2017_08_29_10_12_32_966090.log
-rw-r--r-- 1 root root 46070 Aug 29 10:21 issu_contrail_migrate_compute_node_2017_08_29_10_20_39_185370.log
-rw-r--r-- 1 root root 537 Aug 29 10:21 update_post_issu_vrouter_param_node_custom_2017_08_29_10_21_53_900064.log
-rw-r--r-- 1 root root 448 Aug 29 10:25 reboot_node_2017_08_29_10_22_17_384617.log
-rw-r--r-- 1 root root 38913 Aug 29 10:34 migrate_compute_kernel_node_2017_08_29_10_33_07_187075.log
-rw-r--r-- 1 root root 5154 Aug 29 10:42 issu_contrail_switch_collector_in_compute_node_revert_2017_08_29_10_42_16_737673.log
-rw-r--r-- 1 root root 82139 Aug 29 10:47 issu_contrail_migrate_compute_node_2017_08_29_10_46_26_017647.log
-rw-r--r-- 1 root root 537 Aug 29 10:47 update_post_issu_vrouter_param_node_custom_2017_08_29_10_47_21_828548.log
-rw-r--r-- 1 root root 456 Aug 29 10:51 reboot_node_2017_08_29_10_47_34_063729.log
=============================================

After ISSU implementation a lot of tor-agent crashed. During this time, vrouter-agent also crashed.

As per them, there was issue with ISSU synchronization process which is looped. At that time, they have restarted supervisor-config on all controller nodes repeatedly.

Then they applied the countermeasure patch and resolved the issue loop of supervisor-config

This was recovered around Aug 31 00:44. The supervisor-config process is up around Aug 31 00:51.

After that, at 00: 57 (UTC), tor-agent again crashed and continuously vrouter-agent repeatedly crash and eventually became timeout

rw------- 1 root root 192475136 Aug 30 08:27 core.contrail-tor-ag.3867.lab3adp-00004nn.1504081664
-rw------- 1 root root 389111808 Aug 30 08:27 core.contrail-tor-ag.3871.lab3adp-00004nn.1504081666
-rw------- 1 root root 240373760 Aug 30 08:27 core.contrail-tor-ag.3869.lab3adp-00004nn.1504081671
~~~ snip ~~~
-rw------- 1 root root 43044864 Aug 30 08:28 core.contrail-tor-ag.6328.lab3adp-00004nn.1504081680
-rw------- 1 root root 43003904 Aug 30 08:28 core.contrail-tor-ag.6472.lab3adp-00004nn.1504081680
-rw------- 1 root root 42926080 Aug 30 08:28 core.contrail-tor-ag.6502.lab3adp-00004nn.1504081680
-rw------- 1 root root 2315395072 Aug 30 08:28 core.contrail-vroute.3874.lab3adp-00004nn.1504081678
-rw------- 1 root root 112717824 Aug 31 00:56 core.contrail-tor-ag.6602.lab3adp-00004nn.1504140985
-rw------- 1 roo...

Read more...

Revision history for this message
mehul (pmehul) wrote :

Adding one thing, they are in the process of reverting back to v 2.21

Revision history for this message
mehul (pmehul) wrote :

Hi Team,

Their expectation is provide RCA on below issues.

-The issue contrail-vrouter-agent timeout
-The issue contrail-vrouter-agent continuously crash

Due to above issues, they cannot proceed further for testing in contrail 3.1 version.

-Regards,
Mehul Patel

Revision history for this message
mehul (pmehul) wrote :

Hi Team,

I have also copied the contrail-control process core files from all the control nodes.

The location is same as previous.

-Regards,
Mehul Patel

Revision history for this message
mehul (pmehul) wrote :

Hi Hari,

Below are the core files copied on the below server

IP:10.219.48.123, root/Jtaclab123
Path:/home/mehul/2017-0831-0965

The core file core.29146 is taken when they execute contrail-status and vrouter-agent timeout

[root@LocalStorage 2017-0831-0965]# pwd
/home/mehul/2017-0831-0965
[root@LocalStorage 2017-0831-0965]# ls -l core.29146
-rw-r--r--. 1 8378 8000 1820038456 Sep  1 08:40 core.29146

Below core files for vrouter-agent are repeatedly generated when they restarted supervisor-config

[root@LocalStorage 2017-0831-0965]# ls -l core.contrail-vroute.*
-rw-r--r--. 1 root root 289137919 Sep  7 14:34 core.contrail-vroute.12171.lab3adp-00004nn.1504191941.tar.gz
-rw-r--r--. 1 root root 292363076 Sep  7 14:35 core.contrail-vroute.22170.lab3adp-00004nn.1504195645.tar.gz
-rw-r--r--. 1 root root 285355175 Sep  7 14:30 core.contrail-vroute.26634.lab3adp-00004nn.1504171237.tar.gz
-rw-r--r--. 1 root root 300173644 Sep  7 14:31 core.contrail-vroute.31585.lab3adp-00004nn.1504174259.tar.gz
-rw-r--r--. 1 root root 287186571 Sep  7 14:32 core.contrail-vroute.37345.lab3adp-00004nn.1504180656.tar.gz
-rw-r--r--. 1 root root 295194691 Sep  7 14:32 core.contrail-vroute.482.lab3adp-00004nn.1504185590.tar.gz
-rw-r--r--. 1 root root 283980720 Sep  7 14:33 core.contrail-vroute.9939.lab3adp-00004nn.1504186729.tar.gz

-Regards,
Mehul Patel

Revision history for this message
Hari Prasad Killi (haripk) wrote :

Timeout scenario:
There are 6486 VRFs and 43,022 interfaces in the operational DB entries in the core dump. When config is restarted, all the config would be getting downloaded to agent and agent would have to process all of it afresh (including all objects and their adjacency links etc). Other objects, routes in all these VRFs etc would be getting added. Agent would be busy during this time and may not be responding to the contrail-status requests in time. Need to check how long the timeout status was seen. Cant make out that the agent was in any error state from the gcore, with the given data. Considering the scale involved, I don’t have a way to conclude that everything is in order. But things that I checked (like dumping some of the operational DB entries, objects) seemed fine. During our scale tests, one can try out such tests and figure out the behavior.

Crash reason:
The config being received by agent is all messed up, causing the crashes (for ex, link says "physical-router-physical-interface” but the left & right nodes of the link were not these). Since the test wasn’t done in a sane state (they had ISSU failure which was then patched and continued), we should ignore these crashes and get the tests redone.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/35760
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/35762
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1.1.x

Review in progress for https://review.opencontrail.org/35763
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2.3.x

Review in progress for https://review.opencontrail.org/35764
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/35766
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/35767
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/35762
Committed: http://github.com/Juniper/contrail-controller/commit/58195a9c4b51892f62a162f5bc6a73b89915599c
Submitter: Zuul (<email address hidden>)
Branch: R3.1

commit 58195a9c4b51892f62a162f5bc6a73b89915599c
Author: Manish <email address hidden>
Date: Wed Sep 20 12:25:18 2017 +0530

Agent crash @VrfExport::Notify

Issue here is that notification is coming for deleted peer.
There was a check to handle route notify, same has to be present for vrf
notification as well.

Change-Id: Ib81ffb58ed8c3e0e1fa77dd7d6047aa0c9993c8f
Closes-bug: #1715061

Revision history for this message
Manish Singh (manishs) wrote :

Headless mode not present in 4.0 so fix is not needed.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/35760
Committed: http://github.com/Juniper/contrail-controller/commit/040732520696dc2c502a86429b13d83b78de3382
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit 040732520696dc2c502a86429b13d83b78de3382
Author: Manish <email address hidden>
Date: Wed Sep 20 12:25:18 2017 +0530

Agent crash @VrfExport::Notify

Issue here is that notification is coming for deleted peer.
There was a check to handle route notify, same has to be present for vrf
notification as well.

Change-Id: Ib81ffb58ed8c3e0e1fa77dd7d6047aa0c9993c8f
Closes-bug: #1715061

Revision history for this message
mehul (pmehul) wrote :

Hi Manish,

Just need one clarification.

Hari mentioned that, "All the crashes are due to same issue reported earlier in Launchpad. (https://bugs.launchpad.net/juniperopenstack/+bug/1700517)" This Launchpad(1700517) issue was fixed with v3.1.3 Build77, is this correct ? If yes, then was 1715061 due to the same issue to 1700517 or it's due to different reason?

tags: added: nttc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.