NullPointerException from GearmanJobImpl.getHandle

Bug #708153 reported by Omry Yadan on 2011-01-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gearman Java
High
Eric Lambert

Bug Description

Sometimes, without an obvious reason - I get this NPE:

java.lang.NullPointerException
    at org.gearman.client.GearmanJobImpl.getHandle(GearmanJobImpl.java:101)
    at org.gearman.client.GearmanClientImpl$JobHandle.<init>(GearmanClientImpl.java:103)
    at org.gearman.client.GearmanClientImpl.handleSessionEvent(GearmanClientImpl.java:411)
    at org.gearman.common.GearmanJobServerSession.handleResSessionEvent(GearmanJobServerSession.java:310)
    at org.gearman.common.GearmanJobServerSession.handleSessionEvent(GearmanJobServerSession.java:241)
    at org.gearman.common.GearmanJobServerSession.driveSessionIO(GearmanJobServerSession.java:204)
    at org.gearman.client.GearmanClientImpl.driveClientIO(GearmanClientImpl.java:570)
    at org.gearman.client.GearmanClientImpl.driveRequestTillState(GearmanClientImpl.java:623)
    at org.gearman.client.GearmanClientImpl.submit(GearmanClientImpl.java:275)

Eric Lambert (elambert) on 2012-05-05
Changed in gearman-java:
milestone: none → 0.05
Eric Lambert (elambert) wrote :

This issue appears to be due to a mismatch in how the session and client handle the scenario where a submit_job timesout. The client clears the job from the jobAwaitingCreation set, but the session still has a task for this submit in it's tasksAwaitingAck set. So when the session gets reused to submit a new job and that job receives a response from the server, the task for the original submit (which still has a handle on the original job) handles the response in order to extract the job_handle from the response. But when the client attempts to handle the response, it only has a handle on the newer job (via jobAwatingCreation) since this newer job has not seen the response from the server, the handle for the new job is null and hence the NPE. Put simply, the session is corrupted by the initial timeout.

Changed in gearman-java:
status: New → Confirmed
importance: Undecided → High
milestone: 0.05 → 0.06
assignee: nobody → Eric Lambert (elambert)
Eric Lambert (elambert) wrote :

As described above, the root cause of this issue is that the session gets corrupted by a submit that times out waiting for the server to send back an ack (job_created) message to the submit.

Once a timeout occurs for a particular session, that session should no longer be used to submit new jobs because if/when the server sends back a job_created there is no way to determine if the job_created message was just a delayed response to the initial submit or if it is related to the new submit (since the protocol for submit* and job_created do not provide a token which can be used to reconcile a particular job_created with a specific submit message).

But while the session can no longer reliably be used to submit new jobs to server, the session can still successfully handle messages for any existing submitted jobs (that is jobs which were successfully submitted prior to the time-out) since messages related to these jobs use a job_handle token to id which particular job a message relates to.

Given that the session is corrupted (at least partially) one solution to this would be to close the session when a time-out occurs. This approach has the unfortunate side effect of unnecessarily shutting down communication with existing jobs.

Another approach would be to make the session no longer available for new job submissions but to leave it available for existing jobs and once all existing jobs have been completed (or the connection dropped) then session will be disposed of. It is this approach that I plan to implement in order to resolve this bug.

Also, currently the timeout value is hardcoded, it is conceivable that the hard coded value may be to low for certain environments and as such this problem could be avoided by simply increasing the time-out value. As part of the fix for this bug, there will be the ability to set this timeout value.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers