Mysterious celeryd restart on worker nodes

Bug #879234 reported by Muharem Hrnjadovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Won't Fix
Medium
Muharem Hrnjadovic

Bug Description

nohup: ignoring input
[2011-10-20 21:46:01,538: WARNING/MainProcess] discard: Erased 0 messages from the queue.
[2011-10-20 21:46:01,539: WARNING/MainProcess] -------------- celery@gemsun01 v2.3.1
---- **** -----
--- * *** * -- [Configuration]
-- * - **** --- . broker: amqplib://<email address hidden>:5672/
- ** ---------- . loader: celery.loaders.default.Loader
- ** ---------- . logfile: [stderr]@WARNING
- ** ---------- . concurrency: 32
- ** ---------- . events: OFF
- *** --- * --- . beat: OFF
-- ******* ----
--- ***** ----- [Queues]
 -------------- . celery: exchange:celery (direct) binding:celery
[2011-10-20 21:46:01,778: WARNING/MainProcess] celery@gemsun01 has started.
[2011-10-20 21:50:30,224: INFO/PoolWorker-28] WARNING: heartbeat is disabled.
[2011-10-20 21:50:30,226: INFO/PoolWorker-31] WARNING: heartbeat is disabled.
[2011-10-20 21:50:30,665: INFO/PoolWorker-32] WARNING: heartbeat is disabled.
[2011-10-20 21:50:31,364: INFO/PoolWorker-26] WARNING: heartbeat is disabled.
[2011-10-20 21:50:31,770: INFO/PoolWorker-25] WARNING: heartbeat is disabled.
[2011-10-20 21:50:32,285: INFO/PoolWorker-27] WARNING: heartbeat is disabled.
[2011-10-20 21:50:32,848: INFO/PoolWorker-21] WARNING: heartbeat is disabled.
[2011-10-20 21:50:33,363: INFO/PoolWorker-19] WARNING: heartbeat is disabled.
[2011-10-20 21:50:34,499: INFO/PoolWorker-20] WARNING: heartbeat is disabled.
[2011-10-20 21:50:35,187: INFO/PoolWorker-18] WARNING: heartbeat is disabled.
[2011-10-20 21:50:35,534: INFO/PoolWorker-17] WARNING: heartbeat is disabled.
[2011-10-20 21:50:36,095: INFO/PoolWorker-29] WARNING: heartbeat is disabled.
[2011-10-20 21:50:36,653: INFO/PoolWorker-23] WARNING: heartbeat is disabled.
[2011-10-21 00:01:40,264: WARNING/MainProcess] Restarting celeryd (/usr/bin/celeryd --purge)
[2011-10-21 00:01:52,150: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[2b06d928-7bb9-4297-8537-d0ea858723b7] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,188: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[a7bf91a6-20cf-4a57-83b5-2b21670717c3] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,208: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[fb420554-4d7d-46a0-b90d-fa42a60458ff] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,230: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[8acbaae8-6f28-41fb-ab24-addfd23bff63] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,246: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[23d1bd6c-896f-46ad-b300-8f4104c74490] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,270: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[4e120d69-eada-41fb-86f4-d261ea0d7ec7] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,295: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[cf42d389-3807-4d97-9d04-f1440274bb66] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,319: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[bf0f85ef-0b70-4912-a62e-5d6e2b6e0466] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,345: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[6c83db80-7cbe-4246-9d92-0babe93a59c8] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,369: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[8cd2fc05-93bb-421a-abfa-03c4f8e0aeb6] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,387: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[157ca435-691c-428e-a1c7-722f7263d2d8] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,403: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[8328702d-a405-49a8-b5d9-1032cd27a47c] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:52,429: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[41ff86b1-4bee-43c3-8504-2f140e7bed93] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None

------------------------------------------------------------------------

nohup: ignoring input
[2011-10-20 21:47:42,977: WARNING/MainProcess] discard: Erased 0 messages from the queue.
[2011-10-20 21:47:42,980: WARNING/MainProcess] -------------- celery@gemsun03 v2.3.1
---- **** -----
--- * *** * -- [Configuration]
-- * - **** --- . broker: amqplib://<email address hidden>:5672/
- ** ---------- . loader: celery.loaders.default.Loader
- ** ---------- . logfile: [stderr]@WARNING
- ** ---------- . concurrency: 32
- ** ---------- . events: OFF
- *** --- * --- . beat: OFF
-- ******* ----
--- ***** ----- [Queues]
 -------------- . celery: exchange:celery (direct) binding:celery
[2011-10-20 21:47:43,053: WARNING/MainProcess] celery@gemsun03 has started.
[2011-10-20 21:52:09,434: INFO/PoolWorker-31] WARNING: heartbeat is disabled.
[2011-10-20 21:52:09,956: INFO/PoolWorker-20] WARNING: heartbeat is disabled.
[2011-10-20 21:52:10,427: INFO/PoolWorker-12] WARNING: heartbeat is disabled.
[2011-10-20 21:52:11,105: INFO/PoolWorker-5] WARNING: heartbeat is disabled.
[2011-10-20 21:52:11,537: INFO/PoolWorker-4] WARNING: heartbeat is disabled.
[2011-10-20 21:52:12,096: INFO/PoolWorker-27] WARNING: heartbeat is disabled.
[2011-10-20 21:52:12,627: INFO/PoolWorker-14] WARNING: heartbeat is disabled.
[2011-10-20 21:52:13,153: INFO/PoolWorker-6] WARNING: heartbeat is disabled.
[2011-10-20 21:52:14,262: INFO/PoolWorker-21] WARNING: heartbeat is disabled.
[2011-10-20 21:52:14,939: INFO/PoolWorker-2] WARNING: heartbeat is disabled.
[2011-10-20 21:52:15,348: INFO/PoolWorker-19] WARNING: heartbeat is disabled.
[2011-10-20 21:52:15,960: INFO/PoolWorker-25] WARNING: heartbeat is disabled.
[2011-10-20 21:52:16,514: INFO/PoolWorker-29] WARNING: heartbeat is disabled.
[2011-10-21 00:03:19,680: WARNING/MainProcess] Restarting celeryd (/usr/bin/celeryd --purge)
[2011-10-21 00:03:28,432: WARNING/MainProcess] discard: Erased 0 messages from the queue.
[2011-10-21 00:03:28,434: WARNING/MainProcess] -------------- celery@gemsun03 v2.3.1
---- **** -----
--- * *** * -- [Configuration]
-- * - **** --- . broker: amqplib://<email address hidden>:5672/
- ** ---------- . loader: celery.loaders.default.Loader
- ** ---------- . logfile: [stderr]@WARNING
- ** ---------- . concurrency: 32
- ** ---------- . events: OFF
- *** --- * --- . beat: OFF
-- ******* ----
--- ***** ----- [Queues]
 -------------- . celery: exchange:celery (direct) binding:celery
[2011-10-21 00:03:28,507: WARNING/MainProcess] celery@gemsun03 has started.

------------------------------------------------------------------------

nohup: ignoring input
[2011-10-20 21:45:28,187: WARNING/MainProcess] discard: Erased 0 messages from the queue.
[2011-10-20 21:45:28,191: WARNING/MainProcess] -------------- celery@gemsun04 v2.3.1
---- **** -----
--- * *** * -- [Configuration]
-- * - **** --- . broker: amqplib://<email address hidden>:5672/
- ** ---------- . loader: celery.loaders.default.Loader
- ** ---------- . logfile: [stderr]@WARNING
- ** ---------- . concurrency: 32
- ** ---------- . events: OFF
- *** --- * --- . beat: OFF
-- ******* ----
--- ***** ----- [Queues]
 -------------- . celery: exchange:celery (direct) binding:celery
[2011-10-20 21:45:28,265: WARNING/MainProcess] celery@gemsun04 has started.
[2011-10-20 21:49:48,891: INFO/PoolWorker-6] WARNING: heartbeat is disabled.
[2011-10-20 21:49:49,405: INFO/PoolWorker-32] WARNING: heartbeat is disabled.
[2011-10-20 21:49:49,958: INFO/PoolWorker-28] WARNING: heartbeat is disabled.
[2011-10-20 21:49:50,584: INFO/PoolWorker-7] WARNING: heartbeat is disabled.
[2011-10-20 21:49:51,003: INFO/PoolWorker-18] WARNING: heartbeat is disabled.
[2011-10-20 21:49:51,530: INFO/PoolWorker-3] WARNING: heartbeat is disabled.
[2011-10-20 21:49:52,113: INFO/PoolWorker-12] WARNING: heartbeat is disabled.
[2011-10-20 21:49:53,194: INFO/PoolWorker-1] WARNING: heartbeat is disabled.
[2011-10-20 21:49:53,769: INFO/PoolWorker-19] WARNING: heartbeat is disabled.
[2011-10-20 21:49:54,408: INFO/PoolWorker-20] WARNING: heartbeat is disabled.
[2011-10-20 21:49:54,865: INFO/PoolWorker-15] WARNING: heartbeat is disabled.
[2011-10-20 21:49:55,356: INFO/PoolWorker-2] WARNING: heartbeat is disabled.
[2011-10-21 00:01:04,245: WARNING/MainProcess] Restarting celeryd (/usr/bin/celeryd --purge)
[2011-10-21 00:01:14,927: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[af937660-14eb-4890-976a-bf201ed85e31] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,765: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[0261a92d-795a-4f99-9f60-ae9347dc00e1] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,776: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[37203109-1af5-4c72-9542-b8878959e6e6] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,786: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[f4d4471d-1bdf-42a0-879e-fc88c07e27f5] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,795: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[7e2714ba-adf3-41ee-ba15-f189d790d578] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,804: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[5933ca7a-0521-49b2-acd9-bbfe9a1bc2ce] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,814: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[62f52cef-69b6-4dd3-bf7a-ab6cb777d9dc] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,823: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[150e98fb-173c-4227-8522-b66f2d754f7b] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,834: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[649b5c7a-da3c-46eb-a67f-5149e34c75d6] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,844: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[69fabaeb-4631-4307-b3ba-8e30e0f830cd] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,854: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[860b5949-e86a-4143-8893-f61bd02cd560] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None
[2011-10-21 00:01:15,863: ERROR/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[087ac4a7-5e24-4057-ace6-655008f5bddc] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.
None

Tags: celery devop
tags: added: celery
tags: added: devop
Changed in openquake:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Muharem Hrnjadovic (al-maisan)
milestone: none → 0.4.5
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

This could be related: http://groups.google.com/group/celery-users/browse_thread/thread/c86a5142e4488253/8924b5d0a787a688?lnk=gst&q=hard#8924b5d0a787a688

We may have to refresh the python-celery package since rev. 2.3.1 seems to have introduced a regression..

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

More generally our hazard curve calculation tasks may be too heavy and are expiring. We may need to set the CELERY_TASK_RESULT_EXPIRES parameter to a high value (24 hours?).

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Hmm .. one of the 2.3.2 fixes is:

"Fixes case where the worker could become unresponsive because of tasks exceeding the hard time limit."

(http://docs.celeryproject.org/en/latest/changelog.html#fixes)

Maybe we should repackage python-celery rev. 2.3.3 and see whether this problem goes away..

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Hmm .. from celery/apps/worker.py, line 318

def install_worker_restart_handler(worker):

    def restart_worker_sig_handler(signum, frame):
        """Signal handler restarting the current python program."""
        print("Restarting celeryd (%s)" % (" ".join(sys.argv), ))
        worker.stop(in_sighandler=True)
        os.execv(sys.executable, [sys.executable] + sys.argv)

    platforms.signals["SIGHUP"] = restart_worker_sig_handler

celeryd must have restarted because it received a SIGHUP .. where would that come from?

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Just came to realise that a set of celeryd workers has been running on gemsun02 as well. That may have caused some trouble as well:

gemsun02 ~ $ ps ax | grep celeryd | grep -v grep
 9558 ? Sl 2:14 /usr/bin/python /usr/bin/celeryd --purge
25358 ? Sl 2:27 /usr/bin/python /usr/bin/celeryd --purge
25504 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25505 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25506 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25507 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25508 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25509 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25510 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25511 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25512 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25513 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25514 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25515 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25516 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25517 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25518 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25519 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25520 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25521 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25522 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25523 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25524 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25525 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25526 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25527 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25528 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25529 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25530 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25531 ? S 0:01 /usr/bin/python /usr/bin/celeryd --purge
25532 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25533 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25534 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge
25535 ? S 0:00 /usr/bin/python /usr/bin/celeryd --purge

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

I am reducing the load on the worker machines (running 16 celeryd worker processes only) in order to see whether that makes a difference.

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Until the root cause of the celeryd restarts is found we can work around them by configuring
celery/openquake as follows:

    - Turn on CELERY_ACKS_LATE: Late ack means the task messages will be acknowledged after
       the task has been executed (not just before, which is the default behavior)
    - Turn off message prefetching: (CELERYD_PREFETCH_MULTIPLIER := 1)
    - Make the actual tasks small by using a relatively high HAZARD_TASKS value

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

I have not observed this any more, hence marking this bug as "incomplete".

Changed in openquake:
status: In Progress → Incomplete
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

This problem cropped up on gemsun01 and gemsun03 again.

Changed in openquake:
status: Incomplete → Confirmed
milestone: 0.4.5 → 0.4.6
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

.. with celery-2.4.2

John Tarter (toh2)
Changed in openquake:
milestone: 0.4.6 → 0.5.0
Changed in openquake:
importance: High → Medium
milestone: 0.5.0 → 0.5.1
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

I have not seen this occurring any more.

Changed in openquake:
status: Confirmed → Invalid
Changed in openquake:
status: Invalid → Confirmed
Revision history for this message
Paul Henshaw (paul-sl-henshaw) wrote :

This happened again yesterday on hope running a model for Graeme. Re-opening bug. Further investigation required.

Revision history for this message
matley (matley) wrote :
Changed in openquake:
status: Confirmed → Won't Fix
Revision history for this message
Daniele Viganò (daniele-vigano) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.