gear.Server locking when processing socket input
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Gear |
Invalid
|
Undecided
|
Unassigned |
Bug Description
The Wikimedia Zuul setup stop triggering gearman functions entirely because it can not connect to the embedded Gearman daemon on port 4370.
== symptoms ==
- Zuul status page shows triggered jobs that never finish
- New changes shows up but their jobs are stuck to 'queued' status
- connecting to Zuul embed Gearman server on TCP port 4370 results in a TCP connection timeout
== hypothesis ==
The gear.Server poll sockets opened by the Jenkins German plugin for each executors. It seems sometime an input is received on the socket which triggers poll processing, but there no data can be read from the buffer.
Reading from the buffer is done the sequence:
gear.Connectio
gear.Connectio
while True:
socket.
From there the whole Gearmand is unresponsive to new connection, I believe the socket is blocking and recv() idle waiting for a byte to be received.
== fix ==
1) unlock
- In Jenkins configuration, disable the Gearman plugin
- The Gearman plugins close all opened sockets. I believe that cause socket.conn.recv(1) to abort resuming the Gear server work
- connection to Gearman server on TCP port 4370 works again
- gear admin command 'status' shows all the Gearman function have 0 available workers
2) reregister
- In Jenkins configuration, enable the Gearman plugin
- Jobs are reregistered again
- Use zuul-promote on the first change of each queue to resume the processing of the whole queue
== Some traces ==
I have a python logger for gear.Server set at debug level which is handy. Today incident started around 11:53, a relevant annotated log:
2014-10-15 11:53:31,788 DEBUG gear.Server: Polling 81 connections
2014-10-15 11:53:31,788 DEBUG gear.Server: Processing input on <gear.ServerCon
2014-10-15 11:53:31,789 INFO gear.Server: Received packet from <gear.ServerCon
2014-10-15 11:53:31,789 DEBUG gear.Server: Adding function build:mwext-
And following up:
2014-10-15 11:53:31,789 DEBUG gear.Server: Processing input on <gear.ServerCon
Note there is no INFO message stating it 'Received packet'. Seems to me the conn.recv(1) is idling on the socket.
2014-10-15 11:54:32,896 DEBUG gear.Server: Accepting new connection
2014-10-15 11:54:32,896 INFO gear.Server: Accepted connection <gear.ServerCon
2014-10-15 11:54:38,522 DEBUG gear.Server: Accepting new connection
... more new connections
Polling resume with twice the number of connections :-/
I eventually disable the Gearman plugin in Jenkins. That probably closes all sockets and I ended up with:
2014-10-15 12:38:11,565 DEBUG gear.Server: Received no data on <gear.ServerCon
2014-10-15 12:38:11,566 INFO gear.Server: Marking <gear.ServerCon
Then several:
2014-10-15 12:38:11,567 DEBUG gear.Server: Preparing to poll
2014-10-15 12:38:11,567 DEBUG gear.Server: Polling 142 connections
2014-10-15 12:38:11,567 DEBUG gear.Server: Woken by pipe
From there gear.Server send GET_STATUS request. Since the sockets have been closed by the Gearman plugin that ends up marking the related gear.ServerConn
Reenabling the Gearman plugin resume operations.
== Context ==
- Zuul @ master
- Embedded Zuul gearman server (i.e. based on python-gear)
- Jenkins 1.565.3 (previous LTS versions were affected anyway)
- Gearman plugin 0.7.0 as well as ongoing patch "update to work with Jenkins LTS ver 1.565.3
" https:/
== Next? ==
Can even.POLLIN be triggered without an actual input waiting on the socket? If so since we while True: conn.recv(1) , that would explain the lock and that corner case need to be handled.
Could it be some weird issue in which the buffer is discarded between the POLLIN event and the recv(1) call. I am wondering whether setting a timeout on the socket would solve it by raising an error after attempting to receive.
Two things came to mind:
maybe POLLIN is triggered because the socket is closed / corrupted and socket.conn.recv(1) ends up waiting for nothing.
an EOF is sent which would trigger the POLLIN but cause recv() to still look for 1 bytes when EOF is none