Timeout when converting bug into question (BugTask:+create-question)

Bug #438116 reported by Adam B Butler on 2009-09-28
72
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Curtis Hovey

Bug Description

*** pg8.4 *** OOPS-1759ED1387

Note that while *this* one is a soft timeout, we've seen hard timeouts too, and the fact that the session is being updated correctly sftrongly suffests that the appserver thread state is being preserved for the whole time,which is undesirable.

https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1713N810

Branch: launchpad-rev-9760
Revno: 9760
SQL time: 3065 ms
Non-sql time: 29668 ms
Total time: 32733 ms
Statement Count: 676

https://bugs.launchpad.net/ubuntu/+source/vdpau-video/+bug/633809/+create-question

This send 115 emails, and by subtracting the time between the last two events and dividing, its ~ 200ms per email, +- say 30.

emails are buffered in memory and then spooled once the DB commits (so that we don't mail twice in the event of automatic retry.

672. 6042 1ms SQL-launchpad-main-master SELECT DistributionSourcePackage.bug_count, DistributionSourcePackage.bug_reported_acknowledgement, DistributionSourcePackage.bug_reporting_guidelines, DistributionSourcePackage.distribution, DistributionSourcePackage.id, DistributionSourcePackage.is_upstream_link_allowed, DistributionSourcePackage.max_bug_heat, DistributionSourcePackage.po_message_count, DistributionSourcePackage.sourcepackagename, DistributionSourcePackage.total_bug_heat FROM DistributionSourcePackage WHERE DistributionSourcePackage.sourcepackagename = %s AND DistributionSourcePackage.distribution = %s
673. 32722 1ms SQL-session UPDATE SessionData SET last_accessed = CURRENT_TIMESTAMP WHERE client_id = %s AND last_accessed < CURRENT_TIMESTAMP - '540 seconds'::interval

There are a few things that may need looking at here:
 - why is it taking 200ms to queue a mail
 - why isn't the offload-to-thread stuff working?

There are a couple of things highlighted by this too:
 - we don't see the commit in the oops details
 - we're not getting a hard timeout and we should

In terms of code, the mail sending code is
lp.bugs.browser.bugtask.BugTaskCreateQuestionView
which is bug.convertToQuestion
which does notify(ObjectModifiedEvent)

(Based on looking at where the sends occur relative to the question being made in the action trace.

Related branches

Jonathan Gossage (jgossage) wrote :

I have also experienced this problem attempting to turn https://bugs.edge.launchpad.net/ubuntu/+bug/254130/+create-question into a question. It timed out with Error ID: OOPS-1403ED797. I am also using the "bugs.edge.launchpad.net" server and I accessed the site with Firefox 3.5.4. I attempted to do this 4 times over a 3 hour period without success. The first 3 attempts included a comment and the last did not (to see whether it was part of the problem). This is also my first bug to question conversion attempt.

Is there any workaround so that the handling of the bug I was trying to convert can be completed?

Jonathan Gossage (jgossage) wrote :

I have found that trying to convert bug 254130 into a question times out whenever I attempt to do it. It looks as there may be something in the structure of the bug report that is causing problems.

Jonathan Gossage (jgossage) wrote :

This morning I asked another user on IRC #ubuntu-bugs to convert bug 254130 for me and it worked. So the problem seems to be specific to me, perhaps because I was subscribed to the bug.

Curtis Hovey (sinzui) wrote :

I suspect the solution is to send all the messages as a single message to the subscribers, or to send only the first and last messages.

affects: launchpad → malone
Changed in malone:
importance: Undecided → High
status: New → Triaged
tags: added: oops
Reinhard Tartler (siretart) wrote :

I experience this problem as well while trying to convert Bug #573805 to a question.

Error ID: OOPS-1591H619

Micah Gersten (micahg) wrote :

More recent oops on converting to question: OOPS-1676EB594
https://bugs.edge.launchpad.net/ubuntu/+source/firefox-3.0/+bug/512479/+create-question

Robert Collins (lifeless) wrote :

https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1676EB594

SQL time: 4731 ms
Non-sql time: 9587 ms
Total time: 14318 ms
Statement Count: 586

the high sql statement count is a concern - there is significant overhead in issueing lots of queries.

summary: Timeout Error (OOPS-1367EC560) When Converting Bug into Question
+ (BugTask:+create-question)
Ursula Junque (ursinha) on 2010-08-16
tags: added: timeout
removed: oops
summary: - Timeout Error (OOPS-1367EC560) When Converting Bug into Question
- (BugTask:+create-question)
+ Timeout when converting bug into question (BugTask:+create-question)
description: updated
Robert Collins (lifeless) wrote :

Appears to be something going wrong/odd in deferred-email-sending.

affects: malone → launchpad-foundations
description: updated
Gary Poster (gary) wrote :

The email delivery is explicitly set to *not* be deferred. In out production configs, there is a mail-configure-normal.zcml file with this content:

    <!--
    <mail:queuedDelivery
        permission="zope.SendMail" mailer="smtp"
        queuePath="/srv/launchpad.net/mail_queue/production1"
        />
    -->

    <mail:directDelivery permission="zope.SendMail" mailer="smtp" />

That change is from 2007 ("Stuart Bishop <email address hidden>: Switch to direct delivery until queued email delivery queue directory issues are sorted").

The "queued email delivery queue directory issues" are not clear to me yet. I could not find a related bug for Launchpad or zope.sendmail. I'll see if Stuart can provide me with some background.

I'm also inclined to make a separate bug about exploring queued email delivery for Foundations, and reassigning this end-user bug to Malone, but I won't do it without someone agreeing that it it makes sense.

I don't mind if you reassign, but it would be to answers, not to bugs;
and there isn't any great effort going into answers at the moment ;(.

Stuart Bishop (stub) wrote :

We switched to direct delivery as our local SMTP server was fast enough, and it avoided a problem we saw with queued. We were creating emails that where undeliverable, such as a bad To: address - a real world SMTP server has many more checks than the simple ones in Launchpad and the Zope infrastructure. The SMTP server refused to attempt delivery, and they stayed in the queue indefinitely until someone manually cleared themout. The dud messages were retried each and every time the delivery thread kicked in - the queued delivery was not smart enough to remove emails with permanent failures.

Robert Collins (lifeless) wrote :

We're seeing about 260ms for delivery, If I remember the stats
correctly, which is very slow :(. (100 messages in 26 seconds).

tags: added: pg83
tags: removed: pg83
description: updated
Gary Poster (gary) wrote :

Benji brought up the fact that a local MTA was a superior solution to this problem in his experience (as opposed to writing/fixing Python mail queue software). I'm exploring whether this is already in place, or has already been tested and rejected, or is problematic for some other reason.

Gary Poster (gary) wrote :

Chex reported that Forster is the shared MTA for our production instances. I therefore filed RT 42164 to ask whether a local MTA or some other commodity solution to improving latency would be reasonable and acceptable to the LOSAs.

Tom Haddon (mthaddon) wrote :

Forster is not the shared MTA for our production instances - it's the mailman server for Launchpad. Each production app server runs it's own MTA.

Robert Collins (lifeless) wrote :
Download full text (3.5 KiB)

Some great discussion about this from our ops folk:

06:31 < elmo> lifeless: I do categorically think what I said in the RT, namely that sending mail in-request is crazy town
06:32 < elmo> lifeless: and while i'm OK with us doing some level of investigation about the current MTA performance, I don't think 230ms is all that bad
06:32 < elmo> lifeless: and fundamentally if you guys want to continue to send 100 mails and not time out, I think it's got to be made async from the request
06:33 < lifeless> elmo: so, yes, agreed.
06:33 < gary> We do have a planned story for that, on the basis of what elmo wrote on the RT--basically, back to the original plan, in fact.
06:33 < lifeless> elmo: working on all angles at once.
06:34 < lifeless> elmo: OTOH I don't want to reinvent postfix as a module in LP.
06:34 < lifeless> elmo: putting mails-to-send into the DB and actually spooling them elsewhere is equally nuts. We should definitely make them async.
06:34 < gary> yeah, that was basically the original argument.
06:35 < lifeless> putting a 'please notify about this <action>' statement into a work queue makes a lot of sense to me
06:36 < elmo> so I don't know enough about postfix, but with exim, there is certainly a way of using sendmail and saying "just queue this, do not do anything that might slow me down (e.g. any network stuff)"
06:36 < elmo> OTOH, even using that seems risky if you're going to have a 100+ people being mailed and a 1-2s timeout
06:36 < lifeless> elmo: what are we running on the appservers ?
06:36 < elmo> postfix
06:37 < lifeless> k
06:38 < gary> sending the emails is ideally transactionally aware. I don't know if we do that now.
06:38 < gary> I think we do.
06:38 < gary> peripheral to this discussion; only pertinent if we have to write something fresh
06:39 < lamont> if you're mailing with /usr/sbin/sendmail from postfix, it literally just drops a file in a directory and returns
06:39 < lamont> of course, CPU gets used elsewhere after that
06:39 < lifeless> lamont: we're using SMTP
06:40 < lifeless> hmm, perhaps we should be using the submission port - it might have different rules.
06:40 * lifeless wags
06:42 < lamont> like the good little MTA that it is, postfix puts the CPU intensive stuff squarely on port 25, so that spammers get slowed down too, not just him overloaded.
06:42 < lifeless> lamont: does it treat port 587 differently?
06:42 < lamont> submission port tends to get the same daemon, but with lighter rules in some ways
06:42 < lamont> and, of course, many things are tunable wrt what it checks in which instance of the smtpd process
06:43 < lifeless> lamont: if I filed an rt saying 'please make this as light as possible for lp on lp appservers', would that be a difficult task? or is 'very light' a well known thing ?
06:44 < lamont> do you connect to 127.0.0.1, or to the host's 91.189 iP?
06:44 < lifeless> localhost
06:44 < lifeless> (literally, thats whats configured)
06:44 < lamont> it's generally going to be ugly any way you come at it
06:44 < lifeless> so it will depend on /etc/hosts
06:45 < lamont> life would be simpler if you just used /usr/sbin/sendmail - even the interface is simpler
06:45 < lamont> why re-inven...

Read more...

Curtis Hovey (sinzui) wrote :

Answers does not have deferred email sending. I think the problem here is that we are entering into the question workflow with many messages where it was designed to do exactly one at a time.

Hmm, this really makes the conversion impossible for large projects with many subscribers, because all new bugs come with their giant list of followers, thus triggering the timeout 100% of the time.

Is there any workaround for this until a solution is implemented?

Robert Collins (lifeless) wrote :

On Thu, Nov 11, 2010 at 12:12 PM, Curtis Hovey
<email address hidden> wrote:
> Answers does not have deferred email sending.

The standard zope machinery does deferred sending - mails are not sent
when requested, they are sent when the transaction ends. (This is
still in-appserver code though).

 I think the problem here
> is that we are entering into the question workflow with many messages
> where it was designed to do exactly one at a time.

Do you mean 'we are sending >1 message to many recipients'? That would
add insult to injury to the problem.

@Olivier - sorry, no workaround that we're aware of, but solving this
is very important to us.

Curtis Hovey (sinzui) on 2010-11-14
affects: launchpad-foundations → launchpad-answers
Curtis Hovey (sinzui) wrote :

I retargeted this question to answers because I believe we are looking at a silly implementation issue in answers, not a code issue in bugs or foundations.

The method to create the question is creating new messages for each message on the bug. This could be 20 messages. Does an answer contact want to read all 20 messages? Can the person make sense of this since the issue first messages imply a bug? The process could copy just the first and last messages, or create a single message with the entire conversation. Do we really need to create separate messages--if I were writing this today (I wrote this code), I would try to make BugMessage and QuestionMessage share the same IMessage.

Each message is generating a list of subscribers. Surely the subscribers are the same for all messages.

Deryck Hodge (deryck) wrote :

I've added back a bug task so we don't loose this bug for malone hackers. Is anyone working on this further? If not, I'd like to devote a bugs team guy, since it's blocking Ubuntu from one of their primary bug triage work flows.

Changed in malone:
status: New → Triaged
importance: Undecided → High
Curtis Hovey (sinzui) wrote :

Edwin landed schema changes recently (in production now) that will let us queue the task to building and sending emails. I still think (from a usability prespective and as an answer contact), that needed to send onyl one message to each contact, sending a bug conversation in several messages is confusing because it is not clear until the last one that the bug is really a question. As an answer contact, I really want to know what the question is first. I can read the previous conversation only if I want to clarify a point.

Robert Collins (lifeless) wrote :

> us queue the task to building and sending emails. I still think (from a
> usability prespective and as an answer contact), that needed to send

Also from a performance perspective: just because things are queued,
they still can't do longs transactions or take lots of time ;)

Changed in launchpad:
importance: High → Critical

Not that I want to put pressure on anyone, but this is really annoying and getting worse as time passes, because projects keep growing, with more subscribers to all bugs, etc. ;-)
BTW, shouldn't this bug be confirmed at some point, or is 'Triaged' further in the process than 'Confirmed', in the 'launchpad' project?

Francis J. Lacoste (flacoste) wrote :

Olivier, yes, 'Triaged' means that not only is it Confirmed, but a developer looked at it and a priority has been assigned.

Changed in launchpad:
status: Triaged → Fix Released
status: Fix Released → Triaged
Curtis Hovey (sinzui) wrote :

There are two principle issues described in this bug: sending multiple emails to users, and sending them in proc. The later issue is also the cause of bug 608037, bug 618390, and bug 618385. Fixing this bug will only be about address the unpredictable number of messages. We will solve the other bugs by creating a new mechanism to queue question changes so that messages can be sent a few minutes later.

Changed in launchpad:
assignee: nobody → Curtis Hovey (sinzui)
status: Triaged → In Progress
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r12903 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/12903) by a commit, but not testable.

Changed in launchpad:
milestone: none → 11.05
tags: added: qa-untestable
Changed in launchpad:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2011-04-22
Changed in launchpad:
status: Fix Committed → In Progress
tags: removed: qa-untestable
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2011-04-23
tags: added: qa-ok
removed: qa-needstesting
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers