RCA required - Contrail-vrouter-agent status timeout on the controller nodes for the compute node XMPP hold timer message are printed

Bug #1740267 reported by vijaya kumar shankaran on 2017-12-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
High
Hari Prasad Killi

Bug Description

Contrail-vrouter-agent reported status timeout on the controller node we notice that the XMPP hold timer expired for peer. Vrouter-agent was restarted to resolve the issue. Customer is running 2.21.3-57 contrail version on the compute node. PP Holdtimer expired between controller (kw1np-coct0002n, kw1np-coct0003n) and compute (kw1bp-vscp0040n)

At the time of the issue
kw1bp-vscp0040n contrail]$ date; sudo contrail-status
Thu Dec 21 18:47:05 UTC 2017
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent timeout
contrail-vrouter-nodemgr active

From the contrail-vrouter-agent logs there is nothing much logged in seems the process was not responding

2017-12-21 Thu 17:02:29:863.012 UTC kw1bp-vscp0040n [Thread 47598182344448, Pid 13594]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Client: Event: Tcp Connected peer ip: 10.3.135.73 ( <email address hidden> ) controller/src/xmpp/xmpp_state_machine.cc 1313
2017-12-21 Thu 19:21:20:023.090 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: vhost interface name : vhost0
2017-12-21 Thu 19:21:20:023.187 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: vhost IP Address : 10.2.128.111/26
2017-12-21 Thu 19:21:20:023.197 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: vhost gateway : 10.2.128.65
2017-12-21 Thu 19:21:20:023.201 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: Ethernet port : bond1.61
2017-12-21 Thu 19:21:20:023.208 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: Xmpp Authentication : 0
2017-12-21 Thu 19:21:20:023.215 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: XMPP Server-1 : 0.0.0.0
2017-12-21 Thu 19:21:20:023.220 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: XMPP Server-2 : 0.0.0.0

Output of Process status at the time of the issue
kw1bp-vscp0040n contrail]$ date; ps auxw | grep contrail | grep -v grep
Thu Dec 21 18:48:52 UTC 2017
root 7151 0.0 0.0 230324 13240 ? Ss Apr13 77:11 /usr/bin/python /usr/bin/supervisord -c /etc/contrail/supervisord_vrouter.conf
root 13593 0.0 0.0 335140 28196 ? Sl Apr13 244:28 python /usr/bin/contrail-nodemgr --nodetype=contrail-vrouter
root 13594 19.2 0.1 1437088 341028 ? Sl Apr13 69952:24 /usr/bin/contrail-vrouter-agent
--------------------------------------------

From the message log on the affected compute node we notice
Dec 21 17:57:01 kw1bp-vscp0040n-jp1 log_monitor.py[5832]: [/var/log/ha-log] Dec 21 17:55:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[45035]: WARNING: xmpp_connection is False.Dec 21 17:55:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[45035]: WARNING: xmpp_status is False.Dec 21 17:55:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[45035]: ERROR: monitor_status is FalseDec 21 17:56:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[4007]: WARNING: xmpp_connection is False.Dec 21 17:56:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[4007]: WARNING: xmpp_status is False.Dec 21 17:56:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[4007]: ERROR: monitor_status is False

Dec 21 18:00:01 kw1bp-vscp0040n-jp1 log_monitor.py[28687]: [/var/log/ha-log] Dec 21 17:57:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[11824]: WARNING: xmpp_connection is False.Dec 21 17:57:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[11824]: WARNING: xmpp_status is False.Dec 21 17:57:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[11824]: ERROR: monitor_status is FalseDec 21 17:58:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[19828]: WARNING: xmpp_connection is False.Dec 21 17:58:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[19828]: WARNING: xmpp_status is False.Dec 21 17:58:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[19828]: ERROR: monitor_status is FalseDec 21 17:59:47 kw1bp-vscp0040n-jp1 contrail(contrail10)[27396]: WARNING: xmpp_connection is False.Dec 21 17:59:47 kw1bp-vscp0040n-jp1 contrail(contrail10)[27396]: WARNING: xmpp_status is False.Dec 21 17:59:47 kw1bp-vscp0040n-jp1 contrail(contrail10)[27396]: ERROR: monitor_status is False

Dec 21 19:21:01 kw1bp-vscp0040n-jp1 log_monitor.py[510]: [/var/log/ha-log] Dec 21 19:18:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[25833]: WARNING: xmpp_connection is False.Dec 21 19:18:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[25833]: WARNING: xmpp_status is False.Dec 21 19:18:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[25833]: ERROR: monitor_status is FalseDec 21 19:19:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[33548]: WARNING: xmpp_connection is False.Dec 21 19:19:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[33548]: WARNING: xmpp_status is False.Dec 21 19:19:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[33548]: ERROR: monitor_status is FalseDec 21 19:20:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[48093]: WARNING: xmpp_connection is False.Dec 21 19:20:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[48093]: WARNING: xmpp_status is False.Dec 21 19:20:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[48093]: ERROR: monitor_status is False

Vrouter agent was restarted at Thu Dec 21 19:21:16 and the vrouter-agent came back to active
From supervisord-Vrouter.log
2017-12-21 19:21:16,702 INFO stopped: contrail-vrouter-nodemgr (terminated by SIGTERM)
2017-12-21 19:21:16,744 INFO stopped: contrail-vrouter-agent (terminated by SIGKILL)
2017-12-21 19:21:18,750 INFO spawned: 'contrail-vrouter-nodemgr' with pid 3708
2017-12-21 19:21:18,752 INFO spawned: 'contrail-vrouter-agent' with pid 3709
2017-12-21 19:21:19,930 INFO success: contrail-vrouter-nodemgr entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-12-21 19:21:24,010 INFO success: contrail-vrouter-agent entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

kw1bp-vscp0040n contrail]$ date; sudo contrail-status;sudo supervisorctl -c /etc/contrail/supervisord_vrouter.conf restart all; sleep 1 ; sudo contrail-status
Thu Dec 21 19:21:14 UTC 2017
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent timeout
contrail-vrouter-nodemgr active

========Run time service failures=============
/var/crashes/core.ipmitool.38666.kw1bp-vscp0040n.1490327718
/var/crashes/core.ipmitool.10952.kw1bp-vscp0040n.1493190274
contrail-vrouter-nodemgr: stopped
contrail-vrouter-agent: stopped
contrail-vrouter-nodemgr: started
contrail-vrouter-agent: started
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

At the control nodes XMPP connection status was reported as expired for this compute node
< kw1np-coct0002n /var/log/contrail/contrail-control.log >
--------------------------------------------
2017-12-21 Thu 17:55:07:275.465 UTC kw1np-coct0002n [Thread 140517761337088, Pid 45236]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Server: Event: HoldTimer Expired peer ip: 10.2.128.111 ( kw1bp-vscp0040n ) controller/src/xmpp/xmpp_state_machine.cc 1299
2017-12-21 Thu 17:55:07:303.237 UTC kw1np-coct0002n [Thread 140517895554816, Pid 45236]: SANDESH: Ratelimit Drop (100 messages/second): BGP [SYS_DEBUG]: XmppPeerTableLog: XMPP Peer kw1np-coct0002n:10.2.128.111 in table default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107.inet6.0 in instance default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107 : Unregister routing-table for RibInDelete, RibOutDelete, controller/src/bgp/bgp_peer_membership.cc 1056
--------------------------------------------

< kw1np-coct0003n /var/log/contrail/contrail-control.log >
--------------------------------------------
2017-12-21 Thu 17:55:10:321.834 UTC kw1np-coct0003n [Thread 139650406352640, Pid 15291]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Server: Event: HoldTimer Expired peer ip: 10.2.128.111 ( kw1bp-vscp0040n ) controller/src/xmpp/xmpp_state_machine.cc 1299
2017-12-21 Thu 17:55:10:468.395 UTC kw1np-coct0003n [Thread 139650397955840, Pid 15291]: SANDESH: Ratelimit Drop (100 messages/second): BGP [SYS_DEBUG]: XmppPeerTableLog: XMPP Peer kw1np-coct0003n:10.2.128.111 in table default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107.inet6.0 in instance default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107 : Unregister routing-table for RibInDelete, RibOutDelete, controller/src/bgp/bgp_peer_membership.cc 1056

Looks related to https://bugs.launchpad.net/juniperopenstack/+bug/1715061

Customer needs an RCA.

information type: Proprietary → Public

Hi Team,

can we have an updates on this issue?

Bets Regards,
Vijay Kumar

Jeba Paulaiyan (jebap) on 2018-01-13
tags: added: config contrail-control
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers