celery workers sometimes end up cursed and produce OOPSes for all SnapStoreUploadJobs

Bug #1792920 reported by Colin Watson
34
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Unassigned

Bug Description

We've had two incidents today where a celery worker got into a state where all SnapStoreUploadJobs it ran failed with SSLErrors (e.g. OOPS-5659f581f26c56b85511fa459f81a91d). Other jobs run by the same worker seem to be fine when it's in this state. I suspect that maybe we have some bad connection pooling?

Tags: oops
Revision history for this message
Colin Watson (cjwatson) wrote :

This is due to a single worker getting ENOMEM in response to all mmap syscalls even though there is no clear reason why that should be the case (plenty of memory and not an unreasonable number of current maps). We don't yet know why this is happening.

Revision history for this message
Colin Watson (cjwatson) wrote :

I'm hoping that https://code.launchpad.net/~cjwatson/launchpad/optimise-git-ref-scan/+merge/359171 may improve the situation here. We're also considering no longer scanning refs/changes/*, since it's very large for some repositories and not super-useful to have in the LP database (as opposed to in git).

Revision history for this message
Colin Watson (cjwatson) wrote :

This hasn't been a problem for some time. I don't know whether it was the fixes mentioned in my previous comment or something else, but I'll take it.

Changed in launchpad:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.