Job reduction reduces against jobs being worked on.

Bug #1054378 reported by Eskil Heyn Olsen on 2012-09-21
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Brian Aker

Bug Description

Not necessarily a bug, but a semantic questionable details.

The job reduction (using the uniqueid) just checks against the queue. However, if the job is actively being worked on, it still gets reduced.

Depending on the worker semantic, that might be troublesome. Example;
   * We transfer DB primary-key as job parameters to a worker that reads the row and updates a cache.
   * We set unique='-'
   * If the worker has gotten the job, read the data, but not written to the cache yet...
   * ... and a table update happens, causing a new job to be issued with the same parameters.
   * The second job gets reduced because it's unique matches.
   * However, the cache now contains stale data.

We could set a unique that's a hash of the data, so it'll be different, but to compute that hash would be very expensive and cause user-facing slowdowns. Which is exactly why we use gearmand to asynchronously do the heavy lifting.

Instead I propose this change that optionally lets you run gearmand in a mode, where job reduction is only done against jobs in the queue, and NOT ones with workers.

Patch attached and at

Eskil Heyn Olsen (eskil) wrote :
Brian Aker (brianaker) wrote :

One quick note, are you aware that '-' has a special value as a unique?

Changed in gearmand:
assignee: nobody → Brian Aker (brianaker)
Brian Aker (brianaker) wrote :

Ok, I got a chance to walk through what you are doing, and you are aware and understand its meaning (I know that feature is poorly documented).

How do you handle the race condition where if you have a slow worker it will update the cache entry out of order?

Changed in gearmand:
importance: Undecided → Wishlist
Eskil Heyn Olsen (eskil) wrote :

Hi Brian,

Yes, we're aware of the '-' id. I had an earlier patch that added a '+' id to accomplish what I wanted to do. It acted like '-', but did not reduce if a job was already being worked. I dropped that, because we now have a rising need to put the time-job-was-submitted into the job arguments (long story short, need to deal with db replication across datacenters and want to pause workers until the db replication has caught up). So even "identical" jobs will always have different time stamps, but can reduce on the remaining arguments.

We deal with race conditions in the worker code. The read the current generation of the cached data. When it writes, it does an atomic test-and-set to bump the generation by one. If the test-and-set fails, the entire job is redone.

However in the common case, the job reduction on unique-id is not reliable in our SQL based setup. A worker might be just about to COMMIT already stale data to a denormalised row (caches). Hence the need for this patch.

Changed in gearmand:
status: New → Fix Committed
status: Fix Committed → New
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers