Including binaries when copying pkgs with lots of binaries oopses

Bug #447138 reported by Michael Nelson on 2009-10-09
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Medium
Michael Nelson

Bug Description

OOPS-1378C1323

Nicolas experienced the above oops while trying to copy openoffice.org with binaries included - and mentioned that someone in the security team also experienced this recently?

It seems that we create and (re)fetch each BinaryPackagePublishingHistory record individually - I'm hoping we can instead batch this to reduce both the number of queries being executed (~150reps of many single queries in the above oops) as well as the time spent instantiating and populating the storm object representations.

Related branches

Changed in soyuz:
status: New → Triaged
importance: Undecided → Medium
tags: added: oops
tags: added: tech-debt
tags: added: oem-services
Changed in soyuz:
assignee: nobody → Michael Nelson (michael.nelson)
status: Triaged → In Progress
milestone: none → 3.1.10
Michael Nelson (michael.nelson) wrote :

In terms of a pre-implementation discussion, here's what I'm planning:

1. Look at adding a PublishingSet.copyBinariesTo(binaries, series, pocket, archive) that ensures certain queries are not evaluated for each binary. For example, the biggest timesaver according to the oops is from the PPA component override to main - we look up the 'main' component before creating each binary.

2. The next worst offender is the repeated inserts - can we batch these with storm? would there be much of a benefit? - Hrm, no, it seems storm.store.Store only has a singular add method. Not much to do here afaics.

3. The third worst offender is a select on SecureBinaryPackagePublishingHistory - and I'm not sure where this is coming from? At first I thought it was an artifact of the SBPPH insert, but there's another query listed for that. This one only selects 3 fgields on the SBPPH.

4. The fourth being to get the corresponding BPPH - at the end of PackageSet.newBinaryPublication() we have BinaryPackagePublishingHistory.get(pub.id). This could also be a single query to get all the corresponding BPPHs in one hit if we have PS.copyBinariesTo().

5. Refactor SPPH.getBuiltBinaries() getting rid of the code that currently iterates the results to get the BPPH for unique BPRs. We could instead return a result set of unique (BPPH, BPR) tuples directly from the database as the callsites all use the related BPRs anyway. (I'm assuming this is contributing to the non-sql time).

Changed in soyuz:
status: In Progress → Triaged
Kees Cook (kees) wrote :

I no longer see Oopses, it's just failing with really large packages (linux-source-2.6.15 on dapper, linux on hardy)

Failed to sync: HTTP Error 503: Service Unavailable
{'status': '503', 'content-length': '7319', 'via': '1.1 wildcard.edge.launchpad.net', 'x-powered-by': 'Zope (www.zope.org), Python (www.python.org)', 'server': 'zope.server.http (HTTP)', 'retry-after': '900', 'connection': 'close', 'date': 'Wed, 21 Oct 2009 21:03:37 GMT', 'content-type': 'text/html;charset=utf-8'}

Failed to sync: HTTP Error 502: Bad Gateway
{'status': '502', 'content-length': '921', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.8 (Ubuntu) mod_ssl/2.2.8 OpenSSL/0.9.8g', 'last-modified': 'Wed, 23 Sep 2009 21:37:39 GMT', 'etag': '"24eaac-399-4744586254ec0"', 'date': 'Wed, 21 Oct 2009 20:57:53 GMT', 'content-type': 'text/html'}

retrying and retrying....

Kees Cook (kees) wrote :

(fails around 19 seconds, I assume this is the LP max-query-time-timeout)

Julian Edwards (julian-edwards) wrote :

Kees, if it fails without an oops it indicates that something's not right with an appserver. I think we had some trouble yesterday with them, is this still happening?

If you get this sort of thing again your best action is to find a LOSA.

Michael Nelson (michael.nelson) wrote :

The attached branch deals with (1) and (4) above. We can do a separate branch dealing with (5) too if/when needed.

Changed in soyuz:
status: Triaged → Fix Committed
Michael Nelson (michael.nelson) wrote :

Tested on dogfood before landing:

Setup:
 * disabled redirect after post so that query stats on page are from the
   actual package copy + page generation.

Via UI copied openoffice.org - 1:3.1.0-3ubuntu2~jaunty1 from
   https://dogfood.launchpad.net/~openoffice-pkgs/+archive/ppa/+copy-packages
    At least 2073 queries issued in 11.09 seconds

Patched and re-copied (after deleting original copy):
    At least 1172 queries issued in 9.23 seconds

Changed in soyuz:
status: Fix Committed → Fix Released
Kees Cook (kees) wrote :

This is not fixed for me...

 Publishing linux-source-2.6.15 2.6.15-55.81 to dapper-security ...
Failed to sync (took 18s): HTTP Error 503: Service Unavailable
{'status': '503', 'content-length': '7410', 'via': '1.1 wildcard.edge.launchpad.net', 'x-powered-by': 'Zope (www.zope.org), Python (www.python.org)', 'server': 'zope.server.http (HTTP)', 'retry-after': '900', 'connection': 'close', 'date': 'Fri, 04 Dec 2009 17:45:16 GMT', 'content-type': 'text/html;charset=utf-8'}
Locating linux ...
 Publishing linux 2.6.24-26.64 to hardy-security ...
Failed to sync (took 20s): HTTP Error 503: Service Unavailable
{'status': '503', 'content-length': '7409', 'via': '1.1 wildcard.edge.launchpad.net', 'x-powered-by': 'Zope (www.zope.org), Python (www.python.org)', 'server': 'zope.server.http (HTTP)', 'retry-after': '900', 'connection': 'close', 'date': 'Fri, 04 Dec 2009 17:45:52 GMT', 'content-type': 'text/html;charset=utf-8'}

Kees Cook (kees) wrote :

Should I open a new bug report?

Hi Kees, yes please - linking back to this one. We didn't implement all the possible improvements with this fix, but the details of those other improvements are listed above.

Nicolas Valcarcel (nvalcarcel) wrote :

I hitted it again:
Publishing openoffice.org 1:2.4.1-1ubuntu2.2 to hardy-security ...
Failed to sync (took 5s): HTTP Error 400: Bad Request
{'status': '400', 'content-length': '88', 'via': '1.1 wildcard.launchpad.net', 'x-powered-by': 'Zope (www.zope.org), Python (www.python.org)', 'server': 'zope.server.http (HTTP)', 'connection': 'close', 'date': 'Tue, 02 Feb 2010 21:08:01 GMT', 'content-type': 'text/plain'}

Kees: did you reported this new bug?

Nicolas Valcarcel (nvalcarcel) wrote :

It seems that it is a different issue, the workaround isn't working anymore, i used to change include_binaries=True for include_binaries=False in syncSource and was able to get the package synced, but it's not working anymore.

William Grant (wgrant) wrote :

A 400 is not a timeout -- this is a rather different issue. If you can catch the exception and check the 'content' attribute, you will see a description of the issue.

Nicolas Valcarcel (nvalcarcel) wrote :

What i just posted is the HTTPError.response, should i print HTTPError.content?

Nicolas Valcarcel (nvalcarcel) wrote :

Just printed it an got:
openoffice.org 1:2.4.1-1ubuntu2.2 in hardy (binaries conflicting with the existing ones)
But i don't have openoffice.org in the p3a i'm copying to

The package will have once existed there and was deleted. You can't resurrect the same file names like that as clients may have downloaded the old files, and will be confused by the new ones.

Kees Cook (kees) wrote :

I've opened bug 526645 for the timeouts the security team is still seeing.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers