Queue accepts occasionally time out due to huge non-SQL time in createMissingBuilds

Bug #1062638 reported by William Grant
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
In Progress
Critical
Colin Watson

Bug Description

Accepting a source upload (using either DistroSeries:+queue or PackageUpload:EntryResource:acceptFromQueue) can time out with several non-SQL seconds near the determineArchitecturesToBuild call in SourcePackagePublishingHistory.createMissingBuilds. This has only been seen on the Ubuntu upload of libreoffice 1:3.6.2~rc2-0ubuntu1, from which we have three OOPSes before a retry finally succeeded:

OOPS-4b29d695266f2054311ae5f37329e638
OOPS-bc3d500f772ea9e03cb98de14e919500
OOPS-eddb1503e8421232ad7713cdada76c44

All show the same thing. A several second gap between

  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/queue.py", line 635, in acceptFromQueue
    self._acceptNonSyncFromQueue(logger, dry_run)
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/queue.py", line 619, in _acceptNonSyncFromQueue
    builds = pub_source.createMissingBuilds()
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/publishing.py", line 632, in createMissingBuilds
    architectures_available)
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/publishing.py", line 619, in _getAllowedArchitectures
    self.archive.enabled_restricted_families]
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/archive.py", line 2029, in _getEnabledRestrictedFamilies
    family for (family, archive_arch) in families

and

  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/queue.py", line 635, in acceptFromQueue
    self._acceptNonSyncFromQueue(logger, dry_run)
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/queue.py", line 619, in _acceptNonSyncFromQueue
    builds = pub_source.createMissingBuilds()
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/publishing.py", line 640, in createMissingBuilds
    arch, logger=logger)
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/publishing.py", line 653, in _createMissingBuildForArchitecture
    arch, self.archive)
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/sourcepackagerelease.py", line 434, in getBuildByArch
    distroarchseries.distroseries, archive)
  File "/srv/launchpad.net/production/launchpad-rev-16061/lib/lp/soyuz/model/sourcepackagerelease.py", line 416, in findBuildsByArchitecture
    results = list(query.config(distinct=True))

This happens on appservers, so there's no P-a-s involved. determineArchitecturesToBuild is pretty cheap without P-a-s, so I have to wonder if the the dpkg-architecture subprocess call is giving us trouble (perhaps letting a big xmlrpc-private request take a lock?).

Related branches

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 1062638] [NEW] Queue accepts occasionally time out due to huge non-SQL time in createMissingBuilds

Add a subprocess hook to the python-timeline package? Be a good way to
capture info.

Revision history for this message
Colin Watson (cjwatson) wrote :

Seen again today: OOPS-f4add482b262238f71098f71ccb60cf4

Revision history for this message
Colin Watson (cjwatson) wrote :

This was also with libreoffice, incidentally: version 1:4.1.2~rc3-0ubuntu1. While this does look related to the dpkg-architecture call, I can't see why that would be correlated with the package being accepted, unless perhaps our memory use scales badly with the size of the package being accepted and thus we end up in temporary swapdeath?

Revision history for this message
Colin Watson (cjwatson) wrote :

And in this case it succeeded with only one retry, FWIW.

William Grant (wgrant)
Changed in launchpad:
status: Triaged → Fix Released
Revision history for this message
William Grant (wgrant) wrote :

Seen today with numactl: OOPS-41b1649e773ba687dfde15af44698a0b

Changed in launchpad:
status: Fix Released → Triaged
Revision history for this message
Andy Whitcroft (apw) wrote :

Seen today with makedumpfile: OOPS-66137fc88664e056d55f90b0f334ada8 and OOPS-c5fc1fb18279dd94441302867879725b

Accepted correctly on 3rd attempt.

Revision history for this message
Colin Watson (cjwatson) wrote :

OOPS-ea4157b48705edeeaf8eb9c3334db9b5 seen today: this is a particularly pathological case with lots of architectures.

Revision history for this message
Colin Watson (cjwatson) wrote :

I think the best answer to all this would be to use the code from https://bugs.debian.org/771058 via a python-debian backport, once it's merged.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
Changed in launchpad:
assignee: nobody → Colin Watson (cjwatson)
tags: added: qa-needstesting
Changed in launchpad:
status: Triaged → In Progress
Colin Watson (cjwatson)
tags: added: qa-ok
removed: qa-needstesting
Monika (monikagmail.com)
Changed in launchpad:
status: In Progress → Confirmed
Monika (monikagmail.com)
Changed in launchpad:
status: Confirmed → Fix Released
Colin Watson (cjwatson)
Changed in launchpad:
status: Fix Released → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.