Event streamer eventually gets in error state consuming 100% of a cpu

Bug #1516820 reported by Mikhail S Medvedev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ciwatch
Fix Released
High
Mikhail S Medvedev

Bug Description

After running for some time (between 2 and 10 hours), ci-watch-stream-events
(events.py) gets into a state where it continuously outputs in a quick
succession:

...
2015-11-16 14:04:17,406 [ERROR] Failed json.loads on event:
2015-11-16 14:04:17,406 [ERROR] No JSON object could be decoded
Traceback (most recent call last):
  File "/opt/ciwatch/ciwatch/events.py", line 111, in parse_json_event
    event = json.loads(event)
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
...

That means that one of the GerritEventStream.next() returns an empty event.
And because there is no sleep before fetching next event, this results
in a very busy loop. Why does next() return an empty event is unclear, but
could be because paramiko ssh connection did get disconnected.

We might want to borrow, or at least look at, gerrit listener implementation
from zuul, which is relatively stable:
https://github.com/openstack-infra/zuul/blob/52f0ef1a94c1aec4f83ed0e7d2f7ad425e9ad821/zuul/lib/gerrit.py#L26-L81

Revision history for this message
Mikhail S Medvedev (msmedved) wrote :

Was able to easily reproduce by forcefully killing gerrit ssh connection with tcpkill.

Steps:

Start listening to stream events:

root@vagrant:~# ci-watch-stream-events
2015-11-17 00:33:21,603 [DEBUG] Connecting to review.openstack.org:29418 as msmedved using /opt/ciwatch/vagrant_gerrit_rsa

In a separate terminal:
root@vagrant:~# lsof -i |grep review
ci-watch- 14982 root 3u IPv4 158372 0t0 TCP 10.0.2.15:43932->review.openstack.org:29418 (ESTABLISHED)

Note the port, and do
root@vagrant:~# tcpkill -i eth0 port 43932

After the connection is killed, the ci-watch-stream-events would start dumping lots of error log messages, and unable to recover.

Changed in ciwatch:
assignee: nobody → Mikhail S Medvedev (msmedved)
Changed in ciwatch:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ciwatch (master)

Fix proposed to branch: master
Review: https://review.openstack.org/248919

Changed in ciwatch:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ciwatch (master)

Reviewed: https://review.openstack.org/248919
Committed: https://git.openstack.org/cgit/openstack-infra/ciwatch/commit/?id=56e474d7259e5b496f36930caaeb9104bb2f5bb0
Submitter: Jenkins
Branch: master

commit 56e474d7259e5b496f36930caaeb9104bb2f5bb0
Author: Mikhail S Medvedev <email address hidden>
Date: Mon Nov 23 15:26:35 2015 -0600

    Use zuul gerrit event listener implementation

    Current event listener implementation does not handle network problems
    well. E.g. if ssh stream connection is lost, it would not reconnect or
    recover on its own. Instead of fixing the implementation, use
    well-tested gerrit listener used by zuul. Explicitly specify version of
    zuul to be 2.1.0 to avoid accidental breakages due to changes in zuul
    lib.

    Downside is that we need to install zuul and its dependencies just to
    use gerrit listener.

    parse_json_event function had to be changed, because zuul gerrit event
    listener provides object, not json string. We still need to create
    event from json in populate_db.py, so that part of the function has
    been moved there.

    Closes-Bug: #1516820

    Change-Id: I8aa7a18460b58998f6c378e9d9c0d783032eca21

Changed in ciwatch:
status: In Progress → Fix Committed
Revision history for this message
Mikhail S Medvedev (msmedved) wrote :

Have confirmed that the problem has gone away after the fix. Now if the event stream connection is compromised with tcpkill, event listener recovers without any problem, and starts using a new connection.

Changed in ciwatch:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.