very slow requests sometimes reported as soft timeouts

Bug #623047 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned

Bug Description

https://lp-oops.canonical.com/oops.py/?oopsid=1695L1219

SQL time: 1871 ms
Non-sql time: 43355 ms
Total time: 45226 ms
Statement Count: 490

should have been reported as a timeout but
  SoftRequestTimeout: <security proxied zope.browserpage.metaconfigure.SimpleViewClass from /srv/launchpad.net/production/launchpad-rev-9649/lib/lp/bugs/browser/../templates/bug-create-question.pt instance at 0x2b0227e22a10>

it was reported as a soft timeout - so something didn't catch and cancel it appropriately.

Revision history for this message
Gary Poster (gary) wrote :

Our timeout code is implemented as a hook-point on SQL calls. Since the vast majority of the time in this request was non-sql, it's not surprising that this (and other similar calls, like the wadl generation, bug 607691) was not caught by the timeout code. You can see the break at the end of the SQL log:

486. 5408 4ms launchpad-main-master UPDATE Bug SET heat=calculate_bug_heat(622129), heat_last_updated=CURRENT_TIMESTAMP AT TIME ZONE 'UTC' WHERE Bug.id = %s
487. 45213 2ms session UPDATE SessionData SET last_accessed = CURRENT_TIMESTAMP WHERE client_id = %s AND last_accessed < CURRENT_TIMESTAMP - '540 seconds'::interval

Between 5408 milliseconds and 45213 milliseconds, the database was not accessed. It is mildly curious that there are a few database accesses after 45213, but they are all in the session database: I would not be surprised to learn that the session database is handled differently. In any case, it was not the cause of the problem.

My expectation is that making hard timeouts for non-SQL situations like this will be hard. Python doesn't let you send signals across threads, and apparently for good reason. The profiler hook is much too heavy of a hammer, AFAIK. We don't have an obvious thing that every process will check in on frequently *other* than the database. The component architecture and...something in the template machinery might help a bit.

This is probably a dupe. I should probably try to find it. In any case, for now, because there's not a clear plan of attack, I'm marking this low.

Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Robert Collins (lifeless) wrote :

See also https://bugs.edge.launchpad.net/launchpad-foundations/+bug/627940 which while focused on a different thing will make it possible to check timeouts at other points.

That said, we'll break private bug attachments and other things completely if we fix this universally without the per-pageid timeout control.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.