Investigate lockups and conflict errors

Bug #382457 reported by Paul Everitt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
High
Shane Hathaway

Bug Description

As noted below (and sent to osi-dev this morning), we're having stability problems on karl.soros.org. Symptoms:

1) The points below, to include a relatively high CPU load on one of the httpd processes.

2) Eventually the backlog clears up.

3) Chris Rossi was seeing a bunch of conflict errors.

We need someone that can:

1) Keep an eye on the system and spot quickly spot when it goes into a tailspin. (Perhaps automate something that looks at the server status URL.)

2) Jump in and do forensics before restarting.

3) See if lockups and conflict errors are related.

As an additional note, OSI wired in the "business center" links late in the process (src/karl.external_link_ticket) which has a handshake-over-HTTP protocol which (slim chance) might not be behaving correctly.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Shane, this is probably the one that needs concentrated attention.

Changed in karl3:
assignee: nobody → Shane Hathaway (shane-hathawaymix)
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

As further info, I was just editing in bin/debug. I removed a File object and committed a transaction and got:

ConflictError: database conflict error (oid 0x028a, class zope.index.text.okapiindex.OkapiIndex)

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

It appears to me that repoze.retry is in fact doing its job. All of the conflict errors in the log are preceded by the line "repoze.retry retrying, count = 1", indicating that the request is being retried automatically without showing the user any error. Unfortunately, repoze.retry can't do anything about conflict errors in a Python session, but that doesn't matter to users.

I suggest repoze.retry should format the log entries as single-line warnings instead of dumping a whole traceback, since they really are just warnings. They indicate that the database (ZEO) is slow, that's all.

I doubt the conflicts are related to any lockups.

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

I have added some new pages for forensics:

  https://karl.soros.org/zodbinfo.html
  https://karl.soros.org/people/zodbinfo.html

These show the current state of ZODB connections. When Karl starts to use a lot of CPU and gets backlogged, we should take a snapshot of those two pages before restarting.

We should also monitor how much RAM the processes are consuming at all times. The most likely cause of lockups is thrashing, which is caused by using up too much RAM. We can reduce the RAM consumption just by reducing the ZODB cache sizes.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 382457] Re: Investigate lockups and conflict errors

Good analysis, thanks!

Feel free to take any steps you want to take. Log the action in a
comment on this issue, I'll see it, and if I disagree I'll let you know.

E.g.:

1) Edit the ini file and decrease the cache sizes, then check in the
changes, then restart gracefully.

2) Make Six Feet Up get the memory information into their monitoring,
and test a trap that reports problems.

....or whatever.

--Paul

On Jun 1, 2009, at 4:07 PM, Shane Hathaway wrote:

> I have added some new pages for forensics:
>
> https://karl.soros.org/zodbinfo.html
> https://karl.soros.org/people/zodbinfo.html
>
> These show the current state of ZODB connections. When Karl starts to
> use a lot of CPU and gets backlogged, we should take a snapshot of
> those
> two pages before restarting.
>
> We should also monitor how much RAM the processes are consuming at all
> times. The most likely cause of lockups is thrashing, which is caused
> by using up too much RAM. We can reduce the RAM consumption just by
> reducing the ZODB cache sizes.
>
> --
> Investigate lockups and conflict errors
> https://bugs.launchpad.net/bugs/382457
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Porting KARL to a new architecture: New
>
> Bug description:
> As noted below (and sent to osi-dev this morning), we're having
> stability problems on karl.soros.org. Symptoms:
>
> 1) The points below, to include a relatively high CPU load on one of
> the httpd processes.
>
> 2) Eventually the backlog clears up.
>
> 3) Chris Rossi was seeing a bunch of conflict errors.
>
> We need someone that can:
>
> 1) Keep an eye on the system and spot quickly spot when it goes into
> a tailspin. (Perhaps automate something that looks at the server
> status URL.)
>
> 2) Jump in and do forensics before restarting.
>
> 3) See if lockups and conflict errors are related.
>
> As an additional note, OSI wired in the "business center" links late
> in the process (src/karl.external_link_ticket) which has a handshake-
> over-HTTP protocol which (slim chance) might not be behaving
> correctly.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

I'm going to mark this as resolved. We'll re-open if needed, but I don't think we have an actionable next step.

Changed in karl3:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.