gear.Server locking when processing socket input

Bug #1381565 reported by Antoine "hashar" Musso
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gear
Invalid
Undecided
Unassigned

Bug Description

The Wikimedia Zuul setup stop triggering gearman functions entirely because it can not connect to the embedded Gearman daemon on port 4370.

== symptoms ==

- Zuul status page shows triggered jobs that never finish
- New changes shows up but their jobs are stuck to 'queued' status
- connecting to Zuul embed Gearman server on TCP port 4370 results in a TCP connection timeout

== hypothesis ==

The gear.Server poll sockets opened by the Jenkins German plugin for each executors. It seems sometime an input is received on the socket which triggers poll processing, but there no data can be read from the buffer.

Reading from the buffer is done the sequence:

 gear.Connection.readPacket()
 gear.Connection._readRawBytes(1)
 while True:
    socket.conn.recv(1)

From there the whole Gearmand is unresponsive to new connection, I believe the socket is blocking and recv() idle waiting for a byte to be received.

== fix ==

1) unlock

- In Jenkins configuration, disable the Gearman plugin
- The Gearman plugins close all opened sockets. I believe that cause socket.conn.recv(1) to abort resuming the Gear server work
- connection to Gearman server on TCP port 4370 works again
- gear admin command 'status' shows all the Gearman function have 0 available workers

2) reregister

- In Jenkins configuration, enable the Gearman plugin
- Jobs are reregistered again
- Use zuul-promote on the first change of each queue to resume the processing of the whole queue

== Some traces ==

I have a python logger for gear.Server set at debug level which is handy. Today incident started around 11:53, a relevant annotated log:

2014-10-15 11:53:31,788 DEBUG gear.Server: Polling 81 connections
2014-10-15 11:53:31,788 DEBUG gear.Server: Processing input on <gear.ServerConnection 0x7f18642b1650 name: lanthanum_exec-3 host: 127.0.0.1 port: 10168>
2014-10-15 11:53:31,789 INFO gear.Server: Received packet from <gear.ServerConnection 0x7f18642b1650 name: lanthanum_exec-3 host: 127.0.0.1 port: 10168>: <gear.Packet 0x7f18653a3fd0 type: CAN_DO function: build:mwext-PostEdit-phpcs-HEAD:hasSlaveScripts>
2014-10-15 11:53:31,789 DEBUG gear.Server: Adding function build:mwext-PostEdit-phpcs-HEAD:hasSlaveScripts to <gear.ServerConnection 0x7f18642b1650 name: lanthanum_exec-3 host: 127.0.0.1 port: 10168>

And following up:

2014-10-15 11:53:31,789 DEBUG gear.Server: Processing input on <gear.ServerConnection 0x7f18642b1150 name: lanthanum_exec-5 host: 127.0.0.1 port: 10169>

Note there is no INFO message stating it 'Received packet'. Seems to me the conn.recv(1) is idling on the socket.

2014-10-15 11:54:32,896 DEBUG gear.Server: Accepting new connection
2014-10-15 11:54:32,896 INFO gear.Server: Accepted connection <gear.ServerConnection 0x7f18653af090 name: None host: 127.0.0.1 port: 20414>
2014-10-15 11:54:38,522 DEBUG gear.Server: Accepting new connection
... more new connections
Polling resume with twice the number of connections :-/

I eventually disable the Gearman plugin in Jenkins. That probably closes all sockets and I ended up with:

2014-10-15 12:38:11,565 DEBUG gear.Server: Received no data on <gear.ServerConnection 0x7f18642b1150 name: lanthanum_exec-5 host: 127.0.0.1 port: 10169>
2014-10-15 12:38:11,566 INFO gear.Server: Marking <gear.ServerConnection 0x7f18642b1150 name: lanthanum_exec-5 host: 127.0.0.1 port: 10169> as disconnected

Then several:
2014-10-15 12:38:11,567 DEBUG gear.Server: Preparing to poll
2014-10-15 12:38:11,567 DEBUG gear.Server: Polling 142 connections
2014-10-15 12:38:11,567 DEBUG gear.Server: Woken by pipe

From there gear.Server send GET_STATUS request. Since the sockets have been closed by the Gearman plugin that ends up marking the related gear.ServerConnection as disconnected until there is only 4 connections left (I guess the single Zuul merged and some set description job).

Reenabling the Gearman plugin resume operations.

== Context ==

- Zuul @ master
- Embedded Zuul gearman server (i.e. based on python-gear)
- Jenkins 1.565.3 (previous LTS versions were affected anyway)
- Gearman plugin 0.7.0 as well as ongoing patch "update to work with Jenkins LTS ver 1.565.3
" https://review.openstack.org/#/c/125755/

== Next? ==

Can even.POLLIN be triggered without an actual input waiting on the socket? If so since we while True: conn.recv(1) , that would explain the lock and that corner case need to be handled.

Could it be some weird issue in which the buffer is discarded between the POLLIN event and the recv(1) call. I am wondering whether setting a timeout on the socket would solve it by raising an error after attempting to receive.

Revision history for this message
Antoine "hashar" Musso (hashar) wrote :

Two things came to mind:

maybe POLLIN is triggered because the socket is closed / corrupted and socket.conn.recv(1) ends up waiting for nothing.

an EOF is sent which would trigger the POLLIN but cause recv() to still look for 1 bytes when EOF is none

Revision history for this message
Antoine "hashar" Musso (hashar) wrote :

After disconnecting/reconnecting I noticed in Zuul log a stacktrace (gear 0.5.5):

2014-10-31 17:14:18,757 ERROR gear.Server: Exception in poll loop:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 762, in _doPollLoop
    self._pollLoop()
  File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 802, in _pollLoop
    self.handlePacket(p)
  File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 830, in handlePacket
    self.handleWorkComplete(packet)
  File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 2748, in handleWorkComplete
    self.handlePassthrough(packet, True)
  File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 2775, in handlePassthrough
    raise UnknownJobError()
UnknownJobError

Revision history for this message
Antoine "hashar" Musso (hashar) wrote :

I think this one is solved nowadays. At least I haven't encountered it in ages.

Changed in gear:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.