Launchpad itself

celery hung after long git repository scan

Bug #1783315 reported by Colin Watson on 2018-07-24

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	Critical	Colin Watson

Bug Description

launchpad@ackee's celery hung today. Log excerpts:

[2018-07-24 06:39:12,597: INFO/Worker-2] Running <GitRefScanJob for ~documentfoundation/df-libreoffice/+git/core> (ID 43870243) in status Waiting
[2018-07-24 06:39:12,603: DEBUG3/Worker-2] commit <storm.zope.zstorm.StoreDataManager object at 0x7fc9c57e9650>
[2018-07-24 06:39:12,604: DEBUG3/Worker-2] commit
[2018-07-24 06:39:12,604: DEBUG3/Worker-2] new transaction
[2018-07-24 06:39:12,628: INFO/Worker-2] Starting new HTTP connection (1): git.launchpad.net
[2018-07-24 06:39:18,995: DEBUG/Worker-2] "GET /repo/2115/refs HTTP/1.1" 200 14755548
[2018-07-24 06:40:07,538: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'39fbc8dc-3604-3738-a347-e6f482abf3c9', u'exchange': u'reply.celery.pidbox'} ticket:e8beeca3-cdbe-4daf-8ca2-7855189b4f01]
[2018-07-24 06:40:44,064: INFO/Worker-2] Requesting commit details for [u'f5b8531acc1e944c2e5a2bbee831e562fba89655']
[2018-07-24 06:42:28,485: DEBUG/MainProcess] basic.qos: prefetch_count->7
[2018-07-24 06:43:02,514: DEBUG/MainProcess] basic.qos: prefetch_count->6
[2018-07-24 06:44:12,577: WARNING/MainProcess] Soft time limit (300.0s) exceeded for lp.services.job.celeryjob.CeleryRunJobIgnoreResult[GitRefScanJob_43870243_1c10b906-f8e9-4edc-8b75-921f86234ee9]
[2018-07-24 06:44:58,615: DEBUG/MainProcess] basic.qos: prefetch_count->5
[2018-07-24 06:45:06,699: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'b3a4a1d2-bca7-3d14-ab06-90b4dbe252ba', u'exchange': u'reply.celery.pidbox'} ticket:93eea71c-7d3c-458d-9d5e-474783a05251]
[2018-07-24 06:46:12,681: DEBUG/MainProcess] basic.qos: prefetch_count->4
[2018-07-24 06:48:00,775: DEBUG/MainProcess] basic.qos: prefetch_count->3
[2018-07-24 06:50:06,999: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'73c37ffd-59aa-3fa6-a97c-09a49497ada3', u'exchange': u'reply.celery.pidbox'} ticket:9f356f5d-b925-48ae-912e-6f8402b2f383]
[2018-07-24 06:55:06,498: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'9dd52797-6431-3d48-bd0c-2a0d39f8fe18', u'exchange': u'reply.celery.pidbox'} ticket:e0dcec74-33e4-4b5d-84d8-1b4a42fa2bf0]

... and then nothing else for several hours. The libreoffice repository has been fixed, but it's not obvious why (apparently) the timeout stalled celery.

Tags:

Related branches

lp:~cjwatson/launchpad/celery-soft-timeout-improve-cleanup

Merged into lp:launchpad at revision 18739

William Grant: Approve (code) on 2018-07-30

Revision history for this message

Colin Watson (cjwatson) wrote on 2018-07-25:

This happened again the very next time there was a soft timeout on a repository scan, so there's obviously something repeatably wrong here.

Colin Watson (cjwatson) on 2018-07-25

tags:	added: git lp-code regression
Changed in launchpad:
status:	New → In Progress
importance:	Undecided → Critical
assignee:	nobody → Colin Watson (cjwatson)

Revision history for this message

Colin Watson (cjwatson) wrote on 2018-07-25:

I can't be absolutely sure when this regressed, since such soft timeouts aren't very common; there've only been three since the last rollout. One thing I noticed was that a soft timeout on a PlainPackageCopyJob on 2018-07-16 didn't cause celery to hang, but soft timeouts on GitRefScanJob instances did. My current suspicion is that it has something to do with urlfetch's thread handling.

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2018-07-30:

Fixed in stable r18739 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/18739>.

tags:	added: qa-needstesting
Changed in launchpad:
status:	In Progress → Fix Committed

Revision history for this message

Joel Sing (jsing) wrote on 2018-07-31:

This (or similar) seems to have recurred:

[2018-07-30 21:59:37,919: WARNING/MainProcess] Soft time limit (300.0s) exceeded for lp.services.job.celeryjob.CeleryRunJobIgnoreResult[GitRefScanJob_43982237_f031800a-c77b-463d-831f-d822a9740d7b]

No work appears to have been processed following that point (until celeryd was bounced).

William Grant (wgrant) on 2018-08-01

tags:

added: qa-ok
removed: qa-needstesting

Colin Watson (cjwatson) on 2018-09-01

Changed in launchpad:
status:	Fix Committed → Triaged
tags:	removed: qa-ok

Revision history for this message

Colin Watson (cjwatson) wrote on 2018-10-11:

I can reproduce a somewhat similar hang locally by sending SIGSTOP to a single worker process and then firing off a succession of jobs. It seems that the file descriptor that celery uses to communicate with the worker is still writable, so celery sends a task to it (due to prefetching) and then gets stuck. This is essentially the situation described in https://celery.readthedocs.io/en/latest/userguide/optimizing.html#prefork-pool-prefetch-settings, and indeed running the worker with -Ofair does help.

However, that still leaves a task stuck in a reserved state on a worker that's effectively a zombie, and celery never seems to notice this; over time we'll just end up with all three worker processes stuck and then we'll be back to a similar situation, just less frequently. So we really need to notice that the worker process is stuck and replace it; I haven't quite worked out yet how to persuade celery to do that.

All this is independent of why the worker is hanging in the first place, but at least it explains why the entire pool is getting stuck.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.