Comment 2 for bug 1715061

Revision history for this message
mehul (pmehul) wrote :

Hi Team,

Below is the sequence of the event happened.

They have implemented contrail ISSU in a lab environment

=============================================
-rw-r--r-- 1 root root 10984 Aug 29 05:18 issu_contrail_generate_conf_2017_08_29_05_18_25_896844.log
-rw-r--r-- 1 root root 32121 Aug 29 05:29 issu_contrail_migrate_config_2017_08_29_05_19_00_206072.log
-rw-r--r-- 1 root root 7337 Aug 29 09:53 install_pkg_node_2017_08_29_09_52_05_948152.log
-rw-r--r-- 1 root root 2984 Aug 29 09:54 create_install_repo_node_2017_08_29_09_54_05_434516.log
-rw-r--r-- 1 root root 361021 Aug 29 09:56 create_install_repo_node_2017_08_29_09_54_32_088583.log
-rw-r--r-- 1 root root 38913 Aug 29 09:59 migrate_compute_kernel_node_2017_08_29_09_57_40_635039.log
-rw-r--r-- 1 root root 5892 Aug 29 10:12 issu_contrail_switch_collector_in_compute_node_revert_2017_08_29_10_12_32_966090.log
-rw-r--r-- 1 root root 46070 Aug 29 10:21 issu_contrail_migrate_compute_node_2017_08_29_10_20_39_185370.log
-rw-r--r-- 1 root root 537 Aug 29 10:21 update_post_issu_vrouter_param_node_custom_2017_08_29_10_21_53_900064.log
-rw-r--r-- 1 root root 448 Aug 29 10:25 reboot_node_2017_08_29_10_22_17_384617.log
-rw-r--r-- 1 root root 38913 Aug 29 10:34 migrate_compute_kernel_node_2017_08_29_10_33_07_187075.log
-rw-r--r-- 1 root root 5154 Aug 29 10:42 issu_contrail_switch_collector_in_compute_node_revert_2017_08_29_10_42_16_737673.log
-rw-r--r-- 1 root root 82139 Aug 29 10:47 issu_contrail_migrate_compute_node_2017_08_29_10_46_26_017647.log
-rw-r--r-- 1 root root 537 Aug 29 10:47 update_post_issu_vrouter_param_node_custom_2017_08_29_10_47_21_828548.log
-rw-r--r-- 1 root root 456 Aug 29 10:51 reboot_node_2017_08_29_10_47_34_063729.log
=============================================

After ISSU implementation a lot of tor-agent crashed. During this time, vrouter-agent also crashed.

As per them, there was issue with ISSU synchronization process which is looped. At that time, they have restarted supervisor-config on all controller nodes repeatedly.

Then they applied the countermeasure patch and resolved the issue loop of supervisor-config

This was recovered around Aug 31 00:44. The supervisor-config process is up around Aug 31 00:51.

After that, at 00: 57 (UTC), tor-agent again crashed and continuously vrouter-agent repeatedly crash and eventually became timeout

rw------- 1 root root 192475136 Aug 30 08:27 core.contrail-tor-ag.3867.lab3adp-00004nn.1504081664
-rw------- 1 root root 389111808 Aug 30 08:27 core.contrail-tor-ag.3871.lab3adp-00004nn.1504081666
-rw------- 1 root root 240373760 Aug 30 08:27 core.contrail-tor-ag.3869.lab3adp-00004nn.1504081671
~~~ snip ~~~
-rw------- 1 root root 43044864 Aug 30 08:28 core.contrail-tor-ag.6328.lab3adp-00004nn.1504081680
-rw------- 1 root root 43003904 Aug 30 08:28 core.contrail-tor-ag.6472.lab3adp-00004nn.1504081680
-rw------- 1 root root 42926080 Aug 30 08:28 core.contrail-tor-ag.6502.lab3adp-00004nn.1504081680
-rw------- 1 root root 2315395072 Aug 30 08:28 core.contrail-vroute.3874.lab3adp-00004nn.1504081678
-rw------- 1 root root 112717824 Aug 31 00:56 core.contrail-tor-ag.6602.lab3adp-00004nn.1504140985
-rw------- 1 root root 50020352 Aug 31 00:56 core.contrail-tor-ag.6620.lab3adp-00004nn.1504140990
-rw------- 1 root root 53112832 Aug 31 00:56 core.contrail-tor-ag.6613.lab3adp-00004nn.1504140990
~~~ snip ~~~
-rw------- 1 root root 44822528 Aug 31 00:57 core.contrail-tor-ag.6587.lab3adp-00004nn.1504141049
-rw------- 1 root root 43237376 Aug 31 00:57 core.contrail-tor-ag.21168.lab3adp-00004nn.1504141049
-rw------- 1 root root 48709632 Aug 31 00:57 core.contrail-tor-ag.21271.lab3adp-00004nn.1504141050

-rw------- 1 root root 62373888 Aug 30 08:27 core.contrail-tor-ag.3918.lab3adp-00005nn.1504081672
-rw------- 1 root root 53563392 Aug 30 08:27 core.contrail-tor-ag.3919.lab3adp-00005nn.1504081672
-rw------- 1 root root 60043264 Aug 30 08:27 core.contrail-tor-ag.3929.lab3adp-00005nn.1504081672
-rw------- 1 root root 53788672 Aug 30 08:27 core.contrail-tor-ag.3928.lab3adp-00005nn.1504081672
~~~ snip ~~~
-rw------- 1 root root 43008000 Aug 30 08:28 core.contrail-tor-ag.10771.lab3adp-00005nn.1504081680
-rw------- 1 root root 43036672 Aug 30 08:28 core.contrail-tor-ag.10819.lab3adp-00005nn.1504081680
-rw------- 1 root root 43270144 Aug 30 08:28 core.contrail-tor-ag.10410.lab3adp-00005nn.1504081684
-rw------- 1 root root 99872768 Aug 31 00:56 core.contrail-tor-ag.11297.lab3adp-00005nn.1504140990
-rw------- 1 root root 48185344 Aug 31 00:57 core.contrail-tor-ag.10901.lab3adp-00005nn.1504141046
-rw------- 1 root root 48037888 Aug 31 00:57 core.contrail-tor-ag.10843.lab3adp-00005nn.1504141046
~~~ snip ~~~
-rw------- 1 root root 42999808 Aug 31 00:57 core.contrail-tor-ag.25303.lab3adp-00005nn.1504141049
-rw------- 1 root root 43163648 Aug 31 00:57 core.contrail-tor-ag.25029.lab3adp-00005nn.1504141049
-rw------- 1 root root 43028480 Aug 31 00:57 core.contrail-tor-ag.25299.lab3adp-00005nn.1504141051
=========================

The customer restarted TSN 's supervisor - vrouter yesterday. This issue temporarily recovered. (timeout and crash vrouter-agent). However, the regular crash of vrouter - agent has reoccurred.

Right now they have rollback to v 2.21 environment.

-Regards,
Mehul Patel