RCA required - Contrail-vrouter-agent status timeout on the controller nodes for the compute node XMPP hold timer message are printed

Bug #1740267 reported by vijaya kumar shankaran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Incomplete
High
vijaya kumar shankaran

Bug Description

Contrail-vrouter-agent reported status timeout on the controller node we notice that the XMPP hold timer expired for peer. Vrouter-agent was restarted to resolve the issue. Customer is running 2.21.3-57 contrail version on the compute node. PP Holdtimer expired between controller (kw1np-coct0002n, kw1np-coct0003n) and compute (kw1bp-vscp0040n)

At the time of the issue
kw1bp-vscp0040n contrail]$ date; sudo contrail-status
Thu Dec 21 18:47:05 UTC 2017
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent timeout
contrail-vrouter-nodemgr active

From the contrail-vrouter-agent logs there is nothing much logged in seems the process was not responding

2017-12-21 Thu 17:02:29:863.012 UTC kw1bp-vscp0040n [Thread 47598182344448, Pid 13594]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Client: Event: Tcp Connected peer ip: 10.3.135.73 ( <email address hidden> ) controller/src/xmpp/xmpp_state_machine.cc 1313
2017-12-21 Thu 19:21:20:023.090 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: vhost interface name : vhost0
2017-12-21 Thu 19:21:20:023.187 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: vhost IP Address : 10.2.128.111/26
2017-12-21 Thu 19:21:20:023.197 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: vhost gateway : 10.2.128.65
2017-12-21 Thu 19:21:20:023.201 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: Ethernet port : bond1.61
2017-12-21 Thu 19:21:20:023.208 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: Xmpp Authentication : 0
2017-12-21 Thu 19:21:20:023.215 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: XMPP Server-1 : 0.0.0.0
2017-12-21 Thu 19:21:20:023.220 UTC kw1bp-vscp0040n [Thread 46985195744064, Pid 3709]: XMPP Server-2 : 0.0.0.0

Output of Process status at the time of the issue
kw1bp-vscp0040n contrail]$ date; ps auxw | grep contrail | grep -v grep
Thu Dec 21 18:48:52 UTC 2017
root 7151 0.0 0.0 230324 13240 ? Ss Apr13 77:11 /usr/bin/python /usr/bin/supervisord -c /etc/contrail/supervisord_vrouter.conf
root 13593 0.0 0.0 335140 28196 ? Sl Apr13 244:28 python /usr/bin/contrail-nodemgr --nodetype=contrail-vrouter
root 13594 19.2 0.1 1437088 341028 ? Sl Apr13 69952:24 /usr/bin/contrail-vrouter-agent
--------------------------------------------

From the message log on the affected compute node we notice
Dec 21 17:57:01 kw1bp-vscp0040n-jp1 log_monitor.py[5832]: [/var/log/ha-log] Dec 21 17:55:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[45035]: WARNING: xmpp_connection is False.Dec 21 17:55:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[45035]: WARNING: xmpp_status is False.Dec 21 17:55:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[45035]: ERROR: monitor_status is FalseDec 21 17:56:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[4007]: WARNING: xmpp_connection is False.Dec 21 17:56:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[4007]: WARNING: xmpp_status is False.Dec 21 17:56:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[4007]: ERROR: monitor_status is False

Dec 21 18:00:01 kw1bp-vscp0040n-jp1 log_monitor.py[28687]: [/var/log/ha-log] Dec 21 17:57:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[11824]: WARNING: xmpp_connection is False.Dec 21 17:57:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[11824]: WARNING: xmpp_status is False.Dec 21 17:57:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[11824]: ERROR: monitor_status is FalseDec 21 17:58:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[19828]: WARNING: xmpp_connection is False.Dec 21 17:58:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[19828]: WARNING: xmpp_status is False.Dec 21 17:58:46 kw1bp-vscp0040n-jp1 contrail(contrail10)[19828]: ERROR: monitor_status is FalseDec 21 17:59:47 kw1bp-vscp0040n-jp1 contrail(contrail10)[27396]: WARNING: xmpp_connection is False.Dec 21 17:59:47 kw1bp-vscp0040n-jp1 contrail(contrail10)[27396]: WARNING: xmpp_status is False.Dec 21 17:59:47 kw1bp-vscp0040n-jp1 contrail(contrail10)[27396]: ERROR: monitor_status is False

Dec 21 19:21:01 kw1bp-vscp0040n-jp1 log_monitor.py[510]: [/var/log/ha-log] Dec 21 19:18:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[25833]: WARNING: xmpp_connection is False.Dec 21 19:18:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[25833]: WARNING: xmpp_status is False.Dec 21 19:18:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[25833]: ERROR: monitor_status is FalseDec 21 19:19:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[33548]: WARNING: xmpp_connection is False.Dec 21 19:19:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[33548]: WARNING: xmpp_status is False.Dec 21 19:19:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[33548]: ERROR: monitor_status is FalseDec 21 19:20:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[48093]: WARNING: xmpp_connection is False.Dec 21 19:20:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[48093]: WARNING: xmpp_status is False.Dec 21 19:20:52 kw1bp-vscp0040n-jp1 contrail(contrail10)[48093]: ERROR: monitor_status is False

Vrouter agent was restarted at Thu Dec 21 19:21:16 and the vrouter-agent came back to active
From supervisord-Vrouter.log
2017-12-21 19:21:16,702 INFO stopped: contrail-vrouter-nodemgr (terminated by SIGTERM)
2017-12-21 19:21:16,744 INFO stopped: contrail-vrouter-agent (terminated by SIGKILL)
2017-12-21 19:21:18,750 INFO spawned: 'contrail-vrouter-nodemgr' with pid 3708
2017-12-21 19:21:18,752 INFO spawned: 'contrail-vrouter-agent' with pid 3709
2017-12-21 19:21:19,930 INFO success: contrail-vrouter-nodemgr entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-12-21 19:21:24,010 INFO success: contrail-vrouter-agent entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

kw1bp-vscp0040n contrail]$ date; sudo contrail-status;sudo supervisorctl -c /etc/contrail/supervisord_vrouter.conf restart all; sleep 1 ; sudo contrail-status
Thu Dec 21 19:21:14 UTC 2017
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent timeout
contrail-vrouter-nodemgr active

========Run time service failures=============
/var/crashes/core.ipmitool.38666.kw1bp-vscp0040n.1490327718
/var/crashes/core.ipmitool.10952.kw1bp-vscp0040n.1493190274
contrail-vrouter-nodemgr: stopped
contrail-vrouter-agent: stopped
contrail-vrouter-nodemgr: started
contrail-vrouter-agent: started
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

At the control nodes XMPP connection status was reported as expired for this compute node
< kw1np-coct0002n /var/log/contrail/contrail-control.log >
--------------------------------------------
2017-12-21 Thu 17:55:07:275.465 UTC kw1np-coct0002n [Thread 140517761337088, Pid 45236]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Server: Event: HoldTimer Expired peer ip: 10.2.128.111 ( kw1bp-vscp0040n ) controller/src/xmpp/xmpp_state_machine.cc 1299
2017-12-21 Thu 17:55:07:303.237 UTC kw1np-coct0002n [Thread 140517895554816, Pid 45236]: SANDESH: Ratelimit Drop (100 messages/second): BGP [SYS_DEBUG]: XmppPeerTableLog: XMPP Peer kw1np-coct0002n:10.2.128.111 in table default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107.inet6.0 in instance default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107 : Unregister routing-table for RibInDelete, RibOutDelete, controller/src/bgp/bgp_peer_membership.cc 1056
--------------------------------------------

< kw1np-coct0003n /var/log/contrail/contrail-control.log >
--------------------------------------------
2017-12-21 Thu 17:55:10:321.834 UTC kw1np-coct0003n [Thread 139650406352640, Pid 15291]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Server: Event: HoldTimer Expired peer ip: 10.2.128.111 ( kw1bp-vscp0040n ) controller/src/xmpp/xmpp_state_machine.cc 1299
2017-12-21 Thu 17:55:10:468.395 UTC kw1np-coct0003n [Thread 139650397955840, Pid 15291]: SANDESH: Ratelimit Drop (100 messages/second): BGP [SYS_DEBUG]: XmppPeerTableLog: XMPP Peer kw1np-coct0003n:10.2.128.111 in table default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107.inet6.0 in instance default-domain:ArchTest-05:4043ea15-17ef-4fba-ba08-acd80924c107:4043ea15-17ef-4fba-ba08-acd80924c107 : Unregister routing-table for RibInDelete, RibOutDelete, controller/src/bgp/bgp_peer_membership.cc 1056

Looks related to https://bugs.launchpad.net/juniperopenstack/+bug/1715061

Customer needs an RCA.

information type: Proprietary → Public
Revision history for this message
vijaya kumar shankaran (vijayks) wrote :

Hi Team,

can we have an updates on this issue?

Bets Regards,
Vijay Kumar

Jeba Paulaiyan (jebap)
tags: added: config contrail-control
Revision history for this message
Hari Prasad Killi (haripk) wrote :

From the above description, it looks like the vrouter-agent process wasnt responding to anything. Need info as to whether it was responding to introspect requests etc. At this point, we can only hypothesize if the agent process went into a loop or its task scheduling stopped (TBB issue). We need a few gcores of the agent in this situation to validate the state.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.