codebrowse hangs in production
Bug #928327 reported by
Deryck Hodge
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
loggerhead |
Triaged
|
Critical
|
Unassigned | ||
loggerhead-breezy |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
We have LP incidents reported where codebrowse hangs. I was able to get a couple dore dumps to look at:
https:/
https:/
And here are two better backtraces:
https:/
https:/
summary: |
- codebrowse hangs due to exception/oops handling + codebrowse hangs in production |
description: | updated |
description: | updated |
Changed in loggerhead-breezy: | |
status: | New → Incomplete |
To post a comment you must log in.
Here's lifeless summarizing in IRC:
<lifeless> one core has damaged (I suspect killed but not joined()) threads including a missing mainloop. The missing mainloop would on its own make it appear dead to haproxy.
<lifeless> It is in gc in another thread; one possible theory is it got too big memory wise and what we are looking at is damaged fallout from some attempt to recover it
<lifeless> the other core appears entirely healthy except for the oddness that stuff is stuck in send(); but that is normal if the OS buffer is full, which will happen if the internets are not brilliantly happy (because buffering affects the entire chain)
<lifeless> so we need to know for the first one, as much as we can about how it got to that state - were any sysadmin interventions applied first? (if so, the core doesn't represent the failure, it represents the failure + mangling)
<lifeless> for the second, we need to know the symptoms that were being reported
I'll attach a complete transcript from IRC for those curious and/or working on this.