vrouter-nodemgr stopped receiving process events

Bug #1461119 reported by Hari Prasad Killi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
Sundaresan Rajangam
Trunk
Fix Committed
Medium
Sundaresan Rajangam

Bug Description

Sometime, vrouter-nodemgr stopped receiving process events. It looked like the communication between supervisor and nodemgr went down. When it was in this state, there was no update to the contrail-vrouter-nodemgr-stderr.log file, even when vrouter-agent was stopped and restarted.

It required a restart of supervisor-vrouter to get this back to normal state. Not clear what the trigger to enter this state is.

Tags: analytics
Revision history for this message
Hari Prasad Killi (haripk) wrote :
Changed in juniperopenstack:
milestone: none → r2.30-fcs
importance: Undecided → Medium
Raj Reddy (rajreddy)
Changed in juniperopenstack:
assignee: nobody → c mishra (cdmishra)
c mishra (cdmishra)
Changed in juniperopenstack:
status: New → In Progress
Revision history for this message
c mishra (cdmishra) wrote :
Download full text (5.3 KiB)

Email thread
=============

Begin forwarded message:

From: Hari Prasad Killi <email address hidden>
Subject: Re: Regarding bug #1461119 (https://bugs.launchpad.net/juniperopenstack/r2.20/+bug/1461119)
Date: June 6, 2015 at 9:48:43 PM PDT
To: Chandan Mishra <email address hidden>
Cc: Megh Bhatt <email address hidden>, Raj Reddy <email address hidden>, Sundaresan Rajangam <email address hidden>

Ni Chandan, I was referring to the following:

Traceback (most recent call last):
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 86, in run
    self.finish_response()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 128, in finish_response
    self.write(data)
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 212, in write
    self.send_headers()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 270, in send_headers
    self.send_preamble()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 194, in send_preamble
    'Date: %s\r\n' % format_date_time(time.time())
  File "/usr/lib/python2.7/socket.py", line 324, in write
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe
127.0.0.1 - - [02/Jun/2015 17:11:06] "GET /Snh_SandeshUVECacheReq?x=NodeStatus HTTP/1.1" 500 59
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 651, in __init__
    self.finish()
  File "/usr/lib/python2.7/SocketServer.py", line 710, in finish
    self.wfile.close()
  File "/usr/lib/python2.7/socket.py", line 279, in close
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe

Regards,
Hari

From: Chandan Mishra <email address hidden>
Date: Sunday, June 7, 2015 7:33 AM
To: Hari <email address hidden>
Cc: Megh Bhatt <email address hidden>, Raj Reddy <email address hidden>, Sundaresan Rajangam <email address hidden>
Subject: Re: Regarding bug #1461119 (https://bugs.launchpad.net/juniperopenstack/r2.20/+bug/1461119)

Are you referring to:

"Exception AssertionError: AssertionError() in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored”?
On Jun 5, 2015, at 8:13 PM, Hari Prasad Killi <<email address hidden>...

Read more...

Revision history for this message
c mishra (cdmishra) wrote :

My guess is that this bug is happening because nodemgr builds a backlog of events when bombarded with lot of events in a short span of time. Root cause of that is nodemgr's processing of events at the rate of one event per sec (even though more events may be queued).

Revision history for this message
Sundaresan Rajangam (srajanga) wrote :
Download full text (3.2 KiB)

I see the following error in contrail-vrouter-nodemgr-stdout.log

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 46438)
----------------------------------------
RESULT 2
OKREADY

since supervisord didn't expect/recognize the above message in stdout, it stopped sending further notifications to the event listener (nodemgr).

This exception seem to have occurred when the client closed the socket connection before the response is sent for the introspect request. I have pasted the traceback of this exception from contrail-vrouter-nodemgr-stderr.log.

Traceback (most recent call last):
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 86, in run
    self.finish_response()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 128, in finish_response
    self.write(data)
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 212, in write
    self.send_headers()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 270, in send_headers
    self.send_preamble()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 194, in send_preamble
    'Date: %s\r\n' % format_date_time(time.time())
  File "/usr/lib/python2.7/socket.py", line 324, in write
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe
127.0.0.1 - - [02/Jun/2015 17:11:06] "GET /Snh_SandeshUVECacheReq?x=NodeStatus HTTP/1.1" 500 59
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 651, in __init__
    self.finish()
  File "/usr/lib/python2.7/SocketServer.py", line 710, in finish
    self.wfile.close()
  File "/usr/lib/python2.7/socket.py", line 279, in close
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 458, in sendall
    data_sent += self.send(_get_memory(data, data_sent), flags)
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 435, in send
    return sock.send(data, flags)
error: [Errno 32] Broken pipe

The above trace is printed by handle_error() in SocketServer.py

I could recreate the issue with less timeout value (say 0.001) in contrail-status

Fix:
-----

we use wsgiref module for introspect connection (pysandesh/sandesh_http.py)
WSGIServer is derived from BaseHTTPServer.HTTPServer which is SocketServer.TCPServer
To alleviate this issue, we need to override handl...

Read more...

c mishra (cdmishra)
Changed in juniperopenstack:
status: In Progress → Confirmed
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11656
Submitter: c mishra (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11656
Committed: http://github.org/Juniper/contrail-controller/commit/af1bb5e08d23570af791edef1c7ffe39ddddd1a0
Submitter: Zuul
Branch: master

commit af1bb5e08d23570af791edef1c7ffe39ddddd1a0
Author: Chandan Mishra <email address hidden>
Date: Mon Jun 15 18:05:49 2015 -0700

This commit makes sure that nodemgr handles all pending events
before going to sleep.
Current nodemgr design is such that it sleeps for 1 second between
processing of each event. This fix takes care of that.
We removed all the gevent sleeps from the nodemgr code. Instead
nodemgr just waits in the select call.

Overall it will make processing of events by nodemgr more efficient.

Change-Id: I0566d605b0b83d01b0965c358f786d49a9104ae9
Closes-Bug: 1461119

Revision history for this message
c mishra (cdmishra) wrote :

Following fix is pending

Fix:
-----

we use wsgiref module for introspect connection (pysandesh/sandesh_http.py)
WSGIServer is derived from BaseHTTPServer.HTTPServer which is SocketServer.TCPServer
To alleviate this issue, we need to override handle_error() function (refer class BaseServer in SocketServer.py)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12091
Submitter: c mishra (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/13205
Submitter: Sundaresan Rajangam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/13209
Submitter: Sundaresan Rajangam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13205
Committed: http://github.org/Juniper/contrail-sandesh/commit/0a6de0758cc55884b52912ec729148396e25c9d0
Submitter: Zuul
Branch: master

commit 0a6de0758cc55884b52912ec729148396e25c9d0
Author: Sundaresan Rajangam <email address hidden>
Date: Thu Aug 20 00:22:30 2015 -0700

Patch handle_error() method in WSGIServer class

handle_error() method in BaseServer class spews exception message
in the stdout that interferes with the functioning of the supervisor.

def handle_error(self, request, client_address):
"""Handle an error gracefully. May be overridden.
The default is to print a traceback and continue.
"""
print '-'*40
print 'Exception happened during processing of request from',
print client_address
import traceback
traceback.print_exc() # XXX But this goes to stderr!
print '-'*40

In this case, the exception message was logged when the client closed
the socket connection before the response was sent.
Since the supervisor couldn't recognize this exception log, it stopped
sending futher notification to the nodemgr (event-listener).

Change-Id: Idf31eaf976720b7f991726fc73121ef518c42ccd
Closes-Bug: #1461119

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/13209
Committed: http://github.org/Juniper/contrail-sandesh/commit/6e4a88d1259fb0671db00a9b9df03cc67b0afdf6
Submitter: Zuul
Branch: R2.22-dev

commit 6e4a88d1259fb0671db00a9b9df03cc67b0afdf6
Author: Sundaresan Rajangam <email address hidden>
Date: Thu Aug 20 00:22:30 2015 -0700

Patch handle_error() method in WSGIServer class

handle_error() method in BaseServer class spews exception message
in the stdout that interferes with the functioning of the supervisor.

def handle_error(self, request, client_address):
"""Handle an error gracefully. May be overridden.
The default is to print a traceback and continue.
"""
print '-'*40
print 'Exception happened during processing of request from',
print client_address
import traceback
traceback.print_exc() # XXX But this goes to stderr!
print '-'*40

In this case, the exception message was logged when the client closed
the socket connection before the response was sent.
Since the supervisor couldn't recognize this exception log, it stopped
sending futher notification to the nodemgr (event-listener).

Change-Id: Idf31eaf976720b7f991726fc73121ef518c42ccd
Closes-Bug: #1461119
(cherry picked from commit 0a6de0758cc55884b52912ec729148396e25c9d0)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/15089
Submitter: Sundaresan Rajangam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/15089
Committed: http://github.org/Juniper/contrail-sandesh/commit/6e4fcf1e16a7b19ad8ed360ee39c71b1706a91b3
Submitter: Zuul
Branch: R2.20

commit 6e4fcf1e16a7b19ad8ed360ee39c71b1706a91b3
Author: Sundaresan Rajangam <email address hidden>
Date: Thu Aug 20 00:22:30 2015 -0700

Patch handle_error() method in WSGIServer class

handle_error() method in BaseServer class spews exception message
in the stdout that interferes with the functioning of the supervisor.

def handle_error(self, request, client_address):
"""Handle an error gracefully. May be overridden.
The default is to print a traceback and continue.
"""
print '-'*40
print 'Exception happened during processing of request from',
print client_address
import traceback
traceback.print_exc() # XXX But this goes to stderr!
print '-'*40

In this case, the exception message was logged when the client closed
the socket connection before the response was sent.
Since the supervisor couldn't recognize this exception log, it stopped
sending futher notification to the nodemgr (event-listener).

Change-Id: Idf31eaf976720b7f991726fc73121ef518c42ccd
Closes-Bug: #1461119
(cherry picked from commit 0a6de0758cc55884b52912ec729148396e25c9d0)
(cherry picked from commit 6e4a88d1259fb0671db00a9b9df03cc67b0afdf6)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.