celery branch scanners memory usage keeps growing

Bug #1017754 reported by Haw Loeung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Unassigned

Bug Description

Hi,

As per https://pastebin.canonical.com/68835/, it seems the celery branch scanner workers' memory usage continues to grow. The LP incident logs shows that it was restarted once on the 22nd. lifeless suggests that this is a regression as the previous branch scanners have memory caps in place.

https://pastebin.canonical.com/68836/ shows the current limits of one of the celery worker processes. Notice that the 'Max resident set' is unlimited?

Could you please look into this?

Thanks,

Haw

Haw Loeung (hloeung)
tags: added: canonical-losa-lp
Revision history for this message
Haw Loeung (hloeung) wrote :

11:07 <hloeung> right, and where is it done for the existing scan_branches.py?
11:07 <hloeung> I've tried grepping for 'ulimit' in the whole source tree
11:09 <wgrant> hloeung: Hahaha
11:09 <wgrant> It's in a wrapper
11:09 <wgrant> I'm pretty sure LP doesn't do it
11:09 <wgrant> Ah no
11:09 <wgrant> There we are
11:10 <wgrant> JobRunnerProcess.runJobCommand
11:10 <wgrant> if self.job_source.memory_limit is not None:
11:10 <wgrant> soft_limit, hard_limit = getrlimit(RLIMIT_AS)
11:10 <wgrant> if soft_limit != self.job_source.memory_limit:
11:10 <wgrant> limits = (self.job_source.memory_limit,
hard_limit)
11:10 <wgrant> setrlimit(RLIMIT_AS, limits)

Changed in launchpad:
status: New → Triaged
importance: Undecided → Critical
Curtis Hovey (sinzui)
tags: added: celeryd
Revision history for this message
Haw Loeung (hloeung) wrote :
Download full text (4.5 KiB)

Still happening.

Before:

hloeung@ackee:~$ top
top - 08:04:32 up 112 days, 2:58, 1 user, load average: 2.15, 2.45, 2.41
Tasks: 240 total, 2 running, 238 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.8%us, 0.3%sy, 5.6%ni, 87.7%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
Mem: 6112648k total, 4666020k used, 1446628k free, 16620k buffers
Swap: 2964472k total, 1391676k used, 1572796k free, 225752k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12526 bzrsyncd 20 0 1693m 1.2g 3384 S 0 21.3 20:12.90 [celeryd@ackee:
12528 bzrsyncd 20 0 1754m 862m 3420 S 0 14.5 17:24.91 [celeryd@ackee:
15598 launchpa 20 0 991m 641m 3300 R 39 10.7 17:03.14 python2.6
12527 bzrsyncd 20 0 813m 358m 3424 S 0 6.0 17:03.34 [celeryd@ackee:
25404 launchpa 36 16 626m 278m 9588 S 46 4.7 6:12.32 python2.6
28488 launchpa 20 0 608m 189m 9632 S 0 3.2 0:12.74 python2.6
13642 launchpa 20 0 608m 182m 3308 S 0 3.1 0:13.05 python2.6
13130 rabbitmq 20 0 471m 174m 1296 S 0 2.9 248:42.16 beam.smp
12484 bzrsyncd 20 0 418m 15m 2092 S 0 0.3 0:48.97 [celeryd@ackee:
 1544 launchpa 20 0 162m 10m 1116 S 0 0.2 94:26.06 txlongpoll: acc
15876 bzrsyncd 20 0 317m 9.9m 1944 S 0 0.2 0:03.09 [celerybeat] --
21252 launchpa 20 0 646m 7476 2040 S 0 0.1 13:46.76 python2.6
12503 bzrsyncd 20 0 346m 6164 1940 S 0 0.1 0:34.49 [celeryd@ackee:
30661 hloeung 20 0 29164 4748 2172 S 0 0.1 0:00.22 bash

After restarting bzrsyncd celeryd:

top - 08:08:38 up 112 days, 3:02, 1 user, load average: 2.26, 2.47, 2.43
Tasks: 228 total, 1 running, 227 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.1%us, 0.1%sy, 5.5%ni, 87.7%id, 0.2%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 6112648k total, 2719780k used, 3392868k free, 21220k buffers
Swap: 2964472k total, 491552k used, 2472920k free, 233112k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15598 launchpa 20 0 991m 641m 3300 S 47 10.7 18:46.45 python2.6
25404 launchpa 36 16 626m 277m 9588 S 44 4.6 7:44.93 python2.6
31006 bzrsyncd 20 0 502m 200m 5852 S 0 3.4 0:07.68 [celeryd@ackee:
31008 bzrsyncd 20 0 501m 200m 5696 S 0 3.4 0:07.96 [celeryd@ackee:
31007 bzrsyncd 20 0 499m 197m 5792 S 0 3.3 0:06.45 [celeryd@ackee:
30715 laun...

Read more...

Revision history for this message
Colin Watson (cjwatson) wrote :

I'm not sure exactly when this was fixed, or if it's just become very much less noticeable with the scripts unit having more memory; but https://grafana.admin.canonical.com/d/000000044/telegraf-host?orgId=1&var-juju_controller=All&var-juju_model=All&var-service=launchpad-scripts&var-juju_unit=All&var-host=All&var-mountpoint=All&from=now-30d&to=now&viewPanel=4 looks flat enough to me to suggest that this isn't actually a problem in practice any more. I'm therefore going to close this bug.

Changed in launchpad:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.