Retracing is way too slow

Reported by Evan Dandrea on 2012-04-04
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evan Dandrea
apport (Ubuntu)

Bug Description

We're currently only able to process about three core files per minute with each retracer. We can fire up more retracers and drop some core files at random when we're approaching a high load, both of which mitigate this problem to varying degrees. However, we should also address the problem that our current retracing code is taking far too long to process each core file:

[4:40pm] jjo: ev: basically afaics we are losing the producer|consumer rate battle
[4:40pm] mthaddon: ev: basically the rabbitmq queues are fairly consistently increasing in size, but the server itself is very lightly loaded - should we be firing up more retracers, or is there some way of making existing retracers do more work?
[4:40pm] jjo: ev: while finfolk doesnt get squeezed for load
[4:41pm] ev: the retracers process in serial, so bringing up more of them would be advisable
[4:41pm] jjo: ev: FTR I had fired up to ~7 retracers in parallel , which hover'd loadavg~=8 -> got ~3 oopsen per min
[4:41pm] jjo: ev: but still I couldn't manage to get more than ~3/min

Evan Dandrea (ev) on 2012-04-04
Changed in whoopsie-daisy (Ubuntu):
assignee: nobody → Evan Dandrea (ev)
importance: Undecided → High
status: New → Confirmed
Evan Dandrea (ev) wrote :
Download full text (4.4 KiB)

jjo@finfolk:~$ while printf "[%(%c)T] %s\n" -1 "rabbitmq: $(sudo rabbitmqctl list_queues|egrep retrace|xargs );n_retracers: $(pgrep -fl process_core.py|wc -l);loadavg: $(cat /proc/loadavg)";do sleep 60 || break;done
[Wed 04 Apr 2012 16:53:24 UTC] rabbitmq: retrace_amd64 1121 retrace_i386 2408;n_retracers: 1;loadavg: 1.12 0.99 0.92 2/253 26741
[Wed 04 Apr 2012 16:54:24 UTC] rabbitmq: retrace_amd64 1123 retrace_i386 2408;n_retracers: 1;loadavg: 1.06 0.99 0.92 2/239 28154
[Wed 04 Apr 2012 16:55:24 UTC] rabbitmq: retrace_amd64 1125 retrace_i386 2408;n_retracers: 1;loadavg: 0.91 0.96 0.92 2/243 29786
[Wed 04 Apr 2012 16:56:25 UTC] rabbitmq: retrace_amd64 1125 retrace_i386 2409;n_retracers: 1;loadavg: 0.94 0.96 0.92 2/249 31243
[Wed 04 Apr 2012 16:57:25 UTC] rabbitmq: retrace_amd64 1124 retrace_i386 2409;n_retracers: 1;loadavg: 0.82 0.92 0.91 2/241 847
[Wed 04 Apr 2012 16:58:25 UTC] rabbitmq: retrace_amd64 1124 retrace_i386 2409;n_retracers: 1;loadavg: 0.69 0.87 0.89 2/245 3728
[Wed 04 Apr 2012 16:59:25 UTC] rabbitmq: retrace_amd64 1123 retrace_i386 2411;n_retracers: 1;loadavg: 0.79 0.86 0.89 2/246 6178
[Wed 04 Apr 2012 17:00:25 UTC] rabbitmq: retrace_amd64 1125 retrace_i386 2411;n_retracers: 1;loadavg: 0.87 0.87 0.89 2/247 8308
[Wed 04 Apr 2012 17:01:25 UTC] rabbitmq: retrace_amd64 1125 retrace_i386 2413;n_retracers: 1;loadavg: 0.88 0.87 0.88 1/251 10864
[Wed 04 Apr 2012 17:02:25 UTC] rabbitmq: retrace_amd64 1127 retrace_i386 2413;n_retracers: 1;loadavg: 0.67 0.81 0.86 2/246 12749
[Wed 04 Apr 2012 17:03:26 UTC] rabbitmq: retrace_amd64 1127 retrace_i386 2415;n_retracers: 1;loadavg: 0.67 0.80 0.85 2/249 13979
[Wed 04 Apr 2012 17:04:26 UTC] rabbitmq: retrace_amd64 1129 retrace_i386 2418;n_retracers: 1;loadavg: 0.64 0.77 0.85 2/247 15419
[Wed 04 Apr 2012 17:05:26 UTC] rabbitmq: retrace_amd64 1130 retrace_i386 2421;n_retracers: 1;loadavg: 0.91 0.83 0.86 2/245 15668
[Wed 04 Apr 2012 17:06:26 UTC] rabbitmq: retrace_amd64 1134 retrace_i386 2422;n_retracers: 1;loadavg: 0.81 0.81 0.86 2/245 18508
[Wed 04 Apr 2012 17:07:30 UTC] rabbitmq: retrace_amd64 1133 retrace_i386 2422;n_retracers: 1;loadavg: 1.09 0.87 0.87 2/243 20935
[Wed 04 Apr 2012 17:08:30 UTC] rabbitmq: retrace_amd64 1135 retrace_i386 2425;n_retracers: 1;loadavg: 1.31 0.95 0.90 3/247 23503
[Wed 04 Apr 2012 17:09:30 UTC] rabbitmq: retrace_amd64 1136 retrace_i386 2427;n_retracers: 1;loadavg: 1.19 0.99 0.91 3/248 24136
[Wed 04 Apr 2012 17:10:30 UTC] rabbitmq: retrace_amd64 1139 retrace_i386 2428;n_retracers: 1;loadavg: 0.96 0.97 0.91 2/247 25994
[Wed 04 Apr 2012 17:11:31 UTC] rabbitmq: retrace_amd64 1141 retrace_i386 2428;n_retracers: 1;loadavg: 1.05 1.00 0.93 2/247 26802
[Wed 04 Apr 2012 17:12:31 UTC] rabbitmq: retrace_amd64 1142 retrace_i386 2430;n_retracers: 1;loadavg: 0.82 0.94 0.92 2/246 29829
[Wed 04 Apr 2012 17:13:31 UTC] rabbitmq: retrace_amd64 1143 retrace_i386 2433;n_retracers: 1;loadavg: 0.93 0.95 0.92 2/251 32591
[Wed 04 Apr 2012 17:14:31 UTC] rabbitmq: retrace_amd64 1143 retrace_i386 2433;n_retracers: 1;loadavg: 0.90 0.94 0.92 2/247 2283
[Wed 04 Apr 2012 17:15:31 UTC] rabbitmq: retrace_amd64 1144 retrace_i386 2434;n_retracers: 1;loadavg: 1.18 1.02 0.95 4/254 5365
[Wed 04 Apr 2012 17:16:3...


Evan Dandrea (ev) wrote :

James Troup has said that he does not want to drop core dumps from the queue as a solution to this.

Tom Haddon (mthaddon) on 2012-04-10
tags: added: canonical-webops-eng
Evan Dandrea (ev) wrote :

Part of the problem was that we were including the CoreDump field when writing back the retraced report from apport-retrace, then removing it from the dictionary before we wrote back to Cassandra.

We now call apport-retrace with the -c flag, which drops the CoreDump field before writing out the file.

Evan Dandrea (ev) wrote :

I followed up with James this afternoon. He doesn't want to drop core dumps from the queue at random as the first solution to this, but he's okay with a proper RED-like solution if we ultimately need one.

He also suggested that a cache of extracted debs would be a good idea. I had tried this over the weekend and ultimately avoided it because I was concerned about packages with Replaces fields. However, he pointed out that's largely an academic exercise. Library packages wont be using replaces on the paths to the libraries or their debugging symbols.

Martin Pitt (pitti) wrote :

Merged Ev's caching fix into lp:apport r2272, thanks!

Changed in apport (Ubuntu):
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apport - 2.0.1-0ubuntu1

apport (2.0.1-0ubuntu1) precise; urgency=low

  * New upstream bug fix release. Changes since our previous snapshot:
    - problem_report.py, write_mime(): Fix regression from version 1.95: Add a
      value as attachment if it is bigger than 1000 bytes, not if it is bigger
      than 100. (LP: #977882)
    - packaging-apt-dpkg.py: Avoid constructing and updating the apt.Cache()
      objects multiple times, to speed up retracing. Thanks Evan Dandrea.
      (LP: #973494)
 -- Martin Pitt <email address hidden> Tue, 10 Apr 2012 15:38:11 +0200

Changed in apport (Ubuntu):
status: Fix Committed → Fix Released
Evan Dandrea (ev) wrote :

I've proposed the following merge to apport which brings retracing time down to 8.5 seconds in my sample:

Evan Dandrea (ev) wrote :

We'll need to ensure that the sandboxes created with --sandbox-dir are per-release:

[2:56pm] pitti: one scenario where this would fail is:
[2:56pm] pitti: - retrace a bug on package foo version 1
[2:56pm] pitti: - retrace the next bug with foo version 2
[2:56pm] pitti: (unpacks v2 over v1)
[2:56pm] pitti: - retrace the next bug with foo version 1
[2:56pm] pitti: -> does not change any file, as both versions are considered unpacked
[2:56pm] pitti: so you'd gdb against v2 when you want v1
[2:56pm] pitti: now, we do not actually support his yet
[2:57pm] pitti: as apport always takes the latest available version of a package
[2:57pm] pitti: but you will at least need to take care to have a per-release sandbox

Tom Haddon (mthaddon) wrote :

Even after doubling the number of retracers on finfolk (2x2 -> 2x4) we're still not keeping up with the queue: https://pastebin.canonical.com/65246/

Evan Dandrea (ev) wrote :

One way in which we can improve this is by keeping very large caches. Currently, the caches are purged every so often as we run out of disk space. We can reduce the amount of disk space we need by moving to a copy-on-write approach using aufs.

Changed in whoopsie-daisy (Ubuntu):
status: Confirmed → Invalid
no longer affects: whoopsie-daisy (Ubuntu)
Changed in daisy:
assignee: nobody → Evan Dandrea (ev)
importance: Undecided → High
status: New → Confirmed
Evan Dandrea (ev) wrote :

Is this still a problem? Last I spoke to Tom, I vaguely recall him saying he did not think we needed to add an aufs layer to the cache to speed things up (by letting us keep the caches longer) and keep space down.

Evan Dandrea (ev) wrote :

I'm marking this as fixed. While we have some work to do on keeping disk usage down on the retracers (https://wiki.ubuntu.com/ErrorTracker/ReducingRetracerDiskUsage), they're fast enough to process through large queues.

Changed in daisy:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers