contrail-vrouter-nodemgr fails with Exited status after deleting tor-agent

Bug #1566123 reported by kalagesan
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
Nikhil Bansal
R2.21.x
Fix Committed
Medium
Nikhil Bansal
R2.22.x
Fix Committed
Medium
Nikhil Bansal
R3.0
Fix Committed
Medium
Nikhil Bansal
Trunk
Fix Committed
Medium
Nikhil Bansal

Bug Description

Contrail-vrouter-nodemgr fails after deleting tor-agent.

Customer tested adding/deleting tor-agents with running add_tor_agent_by_id/delete_tor_agent_by_id fabric task.

After the deletion of tor-agent, they found contrail-vrouter-nodemgr service
on a TSN was failed with "EXITED" status.
--------------------------------------------
root@openc-14:~# contrail-status
== Contrail vRouter ==
supervisor-vrouter: active
contrail-tor-agent-1 initializing (ToR:QFX1 connection down)
~snip~
contrail-tor-agent-10057 active
contrail-tor-agent-10058 active
contrail-tor-agent-10121 initializing (ToR:tor0121 connection down)
contrail-tor-agent-11 initializing (ToR:QFX11 connection down)
contrail-tor-agent-4 active
contrail-tor-agent-6 active
contrail-vrouter-agent active
contrail-vrouter-nodemgr EXITED
--------------------------------------------

Following traceback was logged on contrail-vrouter-nodemgr-stderr.log.
--------------------------------------------
process:contrail-tor-agent-10059,groupname:contrail-tor-agent-10059,eventname:PROCESS_STATE_STOPPING
wokeup and found a line
process:contrail-tor-agent-10059,groupname:contrail-tor-agent-10059,eventname:PROCESS_STATE_STOPPED
Sending UVE:NodeStatusUVE(_context='', _scope='', _category='', _send_queue_enabled=True, _seqnum=0, _versionsig=2778367443, _source='openc-14', _instance_id='0', _client=None, _type=6, _hints=1, _http_server=None, _logger=None, _more=False, _node_type='Compute', data=NodeStatus(status=None, name='openc-14-10059', deleted=False, disk_usage_info=None, process_status=None, all_core_file_list=['core.contrail-tor-ag.10636.openc-14.1455020583',...'], _table='ObjectVRouter', process_info=[ProcessInfo(process_name='contrail-tor-agent-10059', process_state='PROCESS_STATE_STOPPED', last_stop_time='1459128985206855', start_count=2, core_file_list=[], last_start_time='1459128470895420', stop_count=2, last_exit_time='', exit_count=0)], description=None), _module='contrail-vrouter-nodemgr', _level=2147483647, _timestamp=1459128985208537, _client_context='', _connect_to_collector=True, _role=0)Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/nodemgr/vrouter_event_manager.py", line 142, in runforever
    self.update_current_process()
  File "/usr/lib/python2.7/dist-packages/nodemgr/event_manager.py", line 90, in update_current_process
    process_state_db = self.get_current_process()
  File "/usr/lib/python2.7/dist-packages/nodemgr/event_manager.py", line 72, in get_current_process
    process_stat_ent = self.get_process_stat_object(proc_name)
  File "/usr/lib/python2.7/dist-packages/nodemgr/vrouter_event_manager.py", line 110, in get_process_stat_object
    return VrouterProcessStat(pname)
  File "/usr/lib/python2.7/dist-packages/nodemgr/vrouter_process_stat.py", line 17, in __init__
    (self.group, self.name) = self.get_vrouter_process_info(pname)
  File "/usr/lib/python2.7/dist-packages/nodemgr/vrouter_process_stat.py", line 27, in get_vrouter_process_info
    for line in open(filename)))
IOError: [Errno 2] No such file or directory: '/etc/contrail/supervisord_vrouter_files/contrail-tor-agent-10059.ini'
<Greenlet at 0x7f3a335e0730: <bound method VrouterEventManager.runforever of <nodemgr.vrouter_event_manager.VrouterEventManager object at 0x7f3a340e8610>>> failed with IOError
--------------------------------------------

customer think the cause of this issue is lack of test condition or error handling process on vrouter_process_stat.py module.

It periodically opens .ini file of tor-agents for checking the process status. However, by timing the file can be already deleted by the fabric task and itraises IO error.

Restart supervisor-vrouter service is necessary to recover the node status.Issue time stamp Mar 25 11:30:00

Restart only contrail-vrouter-nodemgr couldn't solve this. Because process_infoin NodeStatus uve was not recovered by the restart and the status of vrouter or tor-agents on Web GUI keeps to be "Process States unavailable".customer tested this with contrail 2.21.2-28.

Logs are uploaded in the attachments

 Steps to Reproduce

1. Execute following script on build server.
----------------------------------------------------
#!/bin/bash
cd /opt/contrail/utils
while [ 1 ]; do
   fab add_tor_agent_by_id:19,root@10.194.20.166
   fab delete_tor_agent_by_id:19,root@10.194.20.166
done
----------------------------------------------------

2. Then watch contrail-vrouter-nodemgr-stderr.log.
   If problem happens, the same traceback message appears on it.
   In customer test environment, it usually happens within 3 hours.

Tags: analytics
Changed in juniperopenstack:
assignee: nobody → Raj Reddy (rajreddy)
tags: added: analytics
Raj Reddy (rajreddy)
Changed in juniperopenstack:
importance: Undecided → Medium
assignee: Raj Reddy (rajreddy) → Nikhil Bansal (nikhilb-u)
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/19577
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/19853
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/19853
Committed: http://github.org/Juniper/contrail-controller/commit/f641b1166ce847baa56748045399a386b9f965c4
Submitter: Zuul
Branch: master

commit f641b1166ce847baa56748045399a386b9f965c4
Author: Nikhil B <email address hidden>
Date: Tue May 3 21:52:49 2016 +0530

Checking for error in file open

There was a gap between checking for file existence and reading it. The file
could get deleted during that time. Added check for such cases so that file
deletion can be handled
Closes-Bug: 1566123

Change-Id: If9825744a8b89d74e96315c1d6982cee9efe256c

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/20052
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/20053
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/20054
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/20083
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/20053
Committed: http://github.org/Juniper/contrail-controller/commit/d2577c8addd13ff304d9b1ca0e9778888d94697b
Submitter: Zuul
Branch: R2.22.x

commit d2577c8addd13ff304d9b1ca0e9778888d94697b
Author: Nikhil B <email address hidden>
Date: Tue May 3 21:52:49 2016 +0530

Checking for error in file open

There was a gap between checking for file existence and reading it. The file
could get deleted during that time. Added check for such cases so that file
deletion can be handled
Closes-Bug: 1566123

Change-Id: If9825744a8b89d74e96315c1d6982cee9efe256c
(cherry picked from commit f641b1166ce847baa56748045399a386b9f965c4)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/20083
Committed: http://github.org/Juniper/contrail-controller/commit/b5f4e5c764cd8ac66779c5c1b6bb0088262c8497
Submitter: Zuul
Branch: R2.20

commit b5f4e5c764cd8ac66779c5c1b6bb0088262c8497
Author: Nikhil B <email address hidden>
Date: Tue May 3 21:52:49 2016 +0530

Checking for error in file open

There was a gap between checking for file existence and reading it. The file
could get deleted during that time. Added check for such cases so that file
deletion can be handled
Closes-Bug: 1566123

Change-Id: If9825744a8b89d74e96315c1d6982cee9efe256c
(cherry picked from commit f641b1166ce847baa56748045399a386b9f965c4)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/20052
Committed: http://github.org/Juniper/contrail-controller/commit/a447b5d66cee1a941a7712c1c62a9fbdeea1a264
Submitter: Zuul
Branch: R3.0

commit a447b5d66cee1a941a7712c1c62a9fbdeea1a264
Author: Nikhil B <email address hidden>
Date: Tue May 3 21:52:49 2016 +0530

Checking for error in file open

There was a gap between checking for file existence and reading it. The file
could get deleted during that time. Added check for such cases so that file
deletion can be handled
Closes-Bug: 1566123

Change-Id: If9825744a8b89d74e96315c1d6982cee9efe256c
(cherry picked from commit f641b1166ce847baa56748045399a386b9f965c4)

information type: Proprietary → Public
Revision history for this message
kalagesan (kalagesan) wrote :

log & testbed file are uploaded in root@10.204.74.250 pwd:netscreen

file upload path: /root/kannan/1566123

Regards,
Kannan

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/20054
Committed: http://github.org/Juniper/contrail-controller/commit/c17d724f1f8a7f991132d49ae69b64eeaf88c381
Submitter: Zuul
Branch: R2.21.x

commit c17d724f1f8a7f991132d49ae69b64eeaf88c381
Author: Nikhil B <email address hidden>
Date: Tue May 3 21:52:49 2016 +0530

Checking for error in file open

There was a gap between checking for file existence and reading it. The file
could get deleted during that time. Added check for such cases so that file
deletion can be handled
Closes-Bug: 1566123

Change-Id: If9825744a8b89d74e96315c1d6982cee9efe256c
(cherry picked from commit f641b1166ce847baa56748045399a386b9f965c4)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.