satus command gives weird results

Bug #700608 reported by timuckun on 2011-01-09
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Gearman
Medium
Unassigned

Bug Description

Sometimes the status command gives weird results for the "number of workers active" parameter. See example below.

screenshot 0 0 2
reddit 1 1 3
trending 0 0 3
fast_track 0 0 2
consolidated_url 396 3 2
digg 70 70 2
facebook 68 68 3
twitter 22 22 4
url_expire 2891 2553 4
search_result 4967 7 7

As you can see the third param is supposed to be the number of active workers but sometimes is the backlog count and sometimes just a random number.

Something is not right here.

Clint Byrum (clint-fewbar) wrote :

Tim, the scrape doesn't really show me anything without context. What I see are functions that each have between 2 and 7 workers.

Can you update the description to explain what you would expect, and what you see in error? Also a way to repeat it would be super helpful, but I understand thats not always possible.

Marking Incomplete pending response from Tim.

Changed in gearmand:
status: New → Incomplete
timuckun (timuckun) wrote :

According to the documentation on the web the status command works like this.

status

    This sends back a list of all registered functions. Next to
    each function is the number of jobs in the queue, the number of
    running jobs, and the number of capable workers. The columns are
    tab separated, and the list is terminated with a line containing
    a single '.' (period). The format is:

    FUNCTION\tTOTAL\tRUNNING\tAVAILABLE_WORKERS

    Arguments:
    - None.

So the third item is supposed to be number of jobs running.

If you look at the data I posted the following queues are reporting invalid data.

consolidated_url is reporting three running out of two workers.
facebook is reporting 68 running out of 3 workers.
url_expire is reporting 2553 running out of 4 workers.

In my case each worker takes one job at a time so there should be no instances where the running should exceed the number of workers.

Brian Aker (brianaker) wrote :

Hi!

If a job fails to be deleted from the persistent queue or failure happens in gearman_server_io_packet_add()

A call to GEARMAN_COMMAND_RESET_ABILITIES will also cause issue with the number as well. If a worker dies half way through the job then the number can also become out of sync.

The number is just not all that accurate. In the "best of cases" is should be accurate, but any sort of failures/etc can cause the number to go screwy.

Do you have any number on worker/etc failures?

Cheers,
   -Brian

Changed in gearmand:
status: Incomplete → Confirmed
importance: Undecided → Medium
timuckun (timuckun) wrote :

Sorry I don't have any metrics like that.

All I know is that something seems screwy with my combination of gearman .14 and ruby client worker.

You can see my other tickets about gearman locking up the problem may be related.

I am now running gearman with the following flags

opt/sbin/gearmand --pid-file=/var/run/gearman/gearmand.pid --user=gearman --file-descriptors=49152 --threads=4 --daemon --log-file=/var/log/gearman-job-server/gearman.log -vvv -q libdrizzle --libdrizzle-table=momentumweb_queue --libdrizzle-host=my.host.com --libdrizzle-port=3306 --libdrizzle-user=gearman --libdrizzle-password=xxxxxxxxxx --libdrizzle-db=gearman_db_user --libdrizzle-mysql

If I run it without without the threads option it does lock up occasionally.

Everything has to be restarted once that happens.

timuckun (timuckun) wrote :

I just wanted to report that this issue is also present in 0.16.

I cleared the queue, stopped all workers, restarted gearman with an empty queue and then started all workers. Within two minutes of starting workers I checked the stats and it was showing screwy values.

Might this be related to a server / client timeout? We see these quite often but have not had the time nor resources to track it down (it happens mostly in the production environment - which means less screwing around). We've seen it under higher load operations and operations that take a substantial amount of time (PHP based workers that is).

There are clearly still some very bad bugs in gearman. The 100% CPU
bug has been around for a while now and been reported by many people
(including me) but apparently it's a bear to track down and every time
they thought they fixed it the thing seems to come back.

I wonder if the java version is more stable.

Have you checked that out?

I haven't checked out the java version... we are actually debating about moving to a different solution due to some of the issues we have faced. It pains me to say it but unfortunately I haven't been able to locate the issue or a repeatable instance for many of these things.

Brian Aker (brianaker) wrote :

Hi!

On Aug 1, 2011, at 9:03 PM, timuckun wrote:

> The 100% CPU
> bug has been around for a while now and been reported by many people
> (including me) but apparently it's a bear to track down and every time
> they thought they fixed it the thing seems to come back.

We have not been given a repeatable test case by anyone. The one bug that existed for this is solved by setting the number of workers during startup (and adjusting timeouts).

Twice now we have have looked into this problem for people and found that it was a matter of tuning.

Without a repeatable test case (or a customer who has a server we can examine) we can't understand if the problem exists or not. Dozens of linux versions, people with compiled kernels,... just too many variables.

Cheers,
 -Brian

Hi Brian,

> We have not been given a repeatable test case by anyone. The one bug that
> existed for this is solved by setting the number of workers during startup (and adjusting timeouts).

Are you talking about the: --worker-wakeup argument of the server?

Secondly on the timeout - I do not see an option on gearmand specifically... are you saying in the client to specify a higher timeout? Right now we have our clients set to 1s in order to handle a custom timeout in our worker (which is calculated by code). Reasoning for the custom timeout is to avoid an issue we had with PHP and ensuring that it would listen to signals such as a SIGKILL without getting stuck in the loop waiting forever. Seems odd but it wouldn't work any other way...

> Without a repeatable test case (or a customer who has a server we can examine)
> we can't understand if the problem exists or not. Dozens of linux versions, people
> with compiled kernels,... just too many variables.

I could likely provide one the next time it spins out of control through one of our lower tiered environments. The hardest part has really been finding a way to repeat it (which we have not found a single repeatable test case but do notice it happens more on 0.2x versions rather than the 0.1x).

Yonah Russ (c94cs2xkf) wrote :

I have this issue on gearmand 0.33

# (echo status ; sleep 0.1) | nc 10.100.105.140 4730
stdCalculateSegment 280 280 12
batchExactTarget 10 0 0
stdCompletedOrder 0 0 2
stdVisitedSite 9 9 11
batchVisitedSite 461 0 0
batchUserSlurps 400 0 0
batchEmails 0 0 0
batchPlayerRegistered 1 0 0
slurpFriends 2 2 24
dispatchNotifications 0 0 2
processNotificationResult 0 0 16
amazonMailer 0 0 9
sendNotification 2 2 8
slurpUsers 10 10 36

palik (1-infe-w) wrote :

in my case the error occurs by using libmemcached. See https://bugs.launchpad.net/gearmand/+bug/1417151

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers