celery hung after long git repository scan

Bug #1783315 reported by Colin Watson on 2018-07-24
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Colin Watson

Bug Description

launchpad@ackee's celery hung today. Log excerpts:

[2018-07-24 06:39:12,597: INFO/Worker-2] Running <GitRefScanJob for ~documentfoundation/df-libreoffice/+git/core> (ID 43870243) in status Waiting
[2018-07-24 06:39:12,603: DEBUG3/Worker-2] commit <storm.zope.zstorm.StoreDataManager object at 0x7fc9c57e9650>
[2018-07-24 06:39:12,604: DEBUG3/Worker-2] commit
[2018-07-24 06:39:12,604: DEBUG3/Worker-2] new transaction
[2018-07-24 06:39:12,628: INFO/Worker-2] Starting new HTTP connection (1): git.launchpad.net
[2018-07-24 06:39:18,995: DEBUG/Worker-2] "GET /repo/2115/refs HTTP/1.1" 200 14755548
[2018-07-24 06:40:07,538: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'39fbc8dc-3604-3738-a347-e6f482abf3c9', u'exchange': u'reply.celery.pidbox'} ticket:e8beeca3-cdbe-4daf-8ca2-7855189b4f01]
[2018-07-24 06:40:44,064: INFO/Worker-2] Requesting commit details for [u'f5b8531acc1e944c2e5a2bbee831e562fba89655']
[2018-07-24 06:42:28,485: DEBUG/MainProcess] basic.qos: prefetch_count->7
[2018-07-24 06:43:02,514: DEBUG/MainProcess] basic.qos: prefetch_count->6
[2018-07-24 06:44:12,577: WARNING/MainProcess] Soft time limit (300.0s) exceeded for lp.services.job.celeryjob.CeleryRunJobIgnoreResult[GitRefScanJob_43870243_1c10b906-f8e9-4edc-8b75-921f86234ee9]
[2018-07-24 06:44:58,615: DEBUG/MainProcess] basic.qos: prefetch_count->5
[2018-07-24 06:45:06,699: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'b3a4a1d2-bca7-3d14-ab06-90b4dbe252ba', u'exchange': u'reply.celery.pidbox'} ticket:93eea71c-7d3c-458d-9d5e-474783a05251]
[2018-07-24 06:46:12,681: DEBUG/MainProcess] basic.qos: prefetch_count->4
[2018-07-24 06:48:00,775: DEBUG/MainProcess] basic.qos: prefetch_count->3
[2018-07-24 06:50:06,999: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'73c37ffd-59aa-3fa6-a97c-09a49497ada3', u'exchange': u'reply.celery.pidbox'} ticket:9f356f5d-b925-48ae-912e-6f8402b2f383]
[2018-07-24 06:55:06,498: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'9dd52797-6431-3d48-bd0c-2a0d39f8fe18', u'exchange': u'reply.celery.pidbox'} ticket:e0dcec74-33e4-4b5d-84d8-1b4a42fa2bf0]

... and then nothing else for several hours. The libreoffice repository has been fixed, but it's not obvious why (apparently) the timeout stalled celery.

Related branches

Colin Watson (cjwatson) wrote :

This happened again the very next time there was a soft timeout on a repository scan, so there's obviously something repeatably wrong here.

Colin Watson (cjwatson) on 2018-07-25
tags: added: git lp-code regression
Changed in launchpad:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Colin Watson (cjwatson)
Colin Watson (cjwatson) wrote :

I can't be absolutely sure when this regressed, since such soft timeouts aren't very common; there've only been three since the last rollout. One thing I noticed was that a soft timeout on a PlainPackageCopyJob on 2018-07-16 didn't cause celery to hang, but soft timeouts on GitRefScanJob instances did. My current suspicion is that it has something to do with urlfetch's thread handling.

Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Joel Sing (jsing) wrote :

This (or similar) seems to have recurred:

[2018-07-30 21:59:37,919: WARNING/MainProcess] Soft time limit (300.0s) exceeded for lp.services.job.celeryjob.CeleryRunJobIgnoreResult[GitRefScanJob_43982237_f031800a-c77b-463d-831f-d822a9740d7b]

No work appears to have been processed following that point (until celeryd was bounced).

William Grant (wgrant) on 2018-08-01
tags: added: qa-ok
removed: qa-needstesting
Colin Watson (cjwatson) on 2018-09-01
Changed in launchpad:
status: Fix Committed → Triaged
tags: removed: qa-ok
Colin Watson (cjwatson) wrote :

I can reproduce a somewhat similar hang locally by sending SIGSTOP to a single worker process and then firing off a succession of jobs. It seems that the file descriptor that celery uses to communicate with the worker is still writable, so celery sends a task to it (due to prefetching) and then gets stuck. This is essentially the situation described in https://celery.readthedocs.io/en/latest/userguide/optimizing.html#prefork-pool-prefetch-settings, and indeed running the worker with -Ofair does help.

However, that still leaves a task stuck in a reserved state on a worker that's effectively a zombie, and celery never seems to notice this; over time we'll just end up with all three worker processes stuck and then we'll be back to a similar situation, just less frequently. So we really need to notice that the worker process is stuck and replace it; I haven't quite worked out yet how to persuade celery to do that.

All this is independent of why the worker is hanging in the first place, but at least it explains why the entire pool is getting stuck.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers