celeryd 100% cpu when large dhcpd lease file

Bug #1305102 reported by Blake Rouse on 2014-04-09
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Status tracked in Trunk
1.5
Critical
Gavin Panella
Trunk
Critical
Gavin Panella
maas (Ubuntu)
Undecided
Unassigned
Trusty
Undecided
Unassigned

Bug Description

maas-cluster-celery service eats CPU time, as one or more celeryd processes consumes 100% of the cpu. This looks to be due to the dhcpd.lease file being very large, as the time it takes to complete the upload_dhcp_leases file is excessive.

[2014-04-09 14:05:28,405: INFO/MainProcess] Task provisioningserver.tasks.upload_dhcp_leases[f8470913-fff7-49a6-a651-0e2b45901cbb] succeeded in 147.969564829s: None

ubuntu@maas-trusty:~$ ls -lh /var/lib/maas/dhcp
total 18M
-rw-r--r-- 1 root root 17M Apr 9 14:06 dhcpd.leases
-rw-r--r-- 1 root root 366K Apr 3 19:39 dhcpd.leases~

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
34283 maas 20 0 1646232 1.464g 1896 R 99.9 18.9 12:15.58 celeryd
34284 maas 20 0 1970464 1.773g 1896 R 99.9 22.9 14:49.22 celeryd
34259 maas 20 0 140572 17980 1912 R 99.5 0.2 3:32.11 celeryd
36722 maas 20 0 1592204 1.414g 2444 R 99.5 18.2 4:25.09 celeryd
37773 maas 20 0 591952 482116 2380 R 99.2 5.9 0:43.97 celeryd

This was noticed in an environment that is running 4 parallel deployments with 3 to 4 LXC containers per deployment in a constant loop, 24/7.

[Impact]
 * More efficient lease parser to work on larger deployments -- previously would be CPU bound in these cases and sometimes causing timeouts

[Testcase]
 * Install maas; ensure new lease parser works by detecting nodes ip addresses correctly for the maas managed dhcp provider

[Regression Potential]
 * Large at first blush, but this has had lots of testing; both from unittests as well as maasCI and Openstack Interoperability Lab

description: updated
Blake Rouse (blake-rouse) wrote :

Number of unique entries in the dhcpd.lease file.

ubuntu@maas-trusty:~$ grep hardware /var/lib/maas/dhcp/dhcpd.leases | awk '{print $3}' | sort | uniq | wc -l
3931

Gavin Panella (allenap) wrote :

This is related to bug 1153077, and is probably a dupe of it.

Blake Rouse (blake-rouse) wrote :

I think it is some what related, but in this case the queue was actually not filling up, it was doing it just fast enough to not cause that issue. But each celeryd process was consuming larges amounts of cpu time.

Raphaël Badin (rvb) wrote :

I think Gavin is right: they are related. It's just that since bug 1153077, we set an expiration date of the tasks so they don't really pile up if they can't be processed in time. But the task in question is fired every minute; if it takes ~147.9s to complete, it will take hold of all celery's workers pretty quickly.

Raphaël Badin (rvb) wrote :

I'm testing something completely unrelated and, in /var/log/syslog, I have this:

Apr 25 11:49:15 3d8eb172-cc4f-11e3-a506-9c4e363b1c94 dhcpd: Can't create new lease file: Permission denied

The message seems to indicate that the dhcp server is failing to "rotate" its lease file. That might explain why we end of with huge lease files.

Julian Edwards (julian-edwards) wrote :

Can you check the containing directory's permissions? Remember that in the precise → raring transition we went from root-owned to dhcpd-owned directories as the dhcpd now drops permissions as needed.

Raphaël Badin (rvb) wrote :

rvb@maas:~$ ls -la /var/lib/maas/dhcp/
total 16
drwxr-xr-x 2 root root 4096 Apr 30 10:01 .
drwxr-xr-x 5 maas maas 4096 Apr 29 18:38 ..
-rw-r--r-- 1 root root 1228 Apr 30 14:50 dhcpd.leases
-rw-r--r-- 1 root root 817 Apr 30 09:58 dhcpd.leases~

rvb@maas:~$ ls -la /var/lib/maas/
total 52
drwxr-xr-x 5 maas maas 4096 Apr 29 18:38 .
drwxr-xr-x 46 root root 4096 Apr 29 18:32 ..
drwxr-xr-x 5 maas maas 4096 Apr 30 11:31 boot-resources
-rw-r--r-- 1 maas maas 16384 Apr 30 16:08 celerybeat-cluster-schedule
-rw-r--r-- 1 maas maas 12288 Apr 30 10:01 celerybeat-region-schedule
drwxr-xr-x 2 root root 4096 Apr 30 10:01 dhcp
-rw-r--r-- 1 root root 4 Apr 29 18:38 dhcpd-interfaces
drwxr-xr-x 2 maas maas 4096 Apr 29 18:34 media

Raphaël Badin (rvb) wrote :

Btw, if you have a look at the dhcpd.leases file posted on a related bug (https://bugs.launchpad.net/maas/+bug/1153077/+attachment/3689992/+files/dhcpd.leases.gz), you'll see that, although the file weights 28M, there are only 48 different IP<->MAC mappings in it. i.e. a *lot* of repetitions.

Gavin Panella (allenap) on 2014-04-30
Changed in maas:
status: Triaged → In Progress
assignee: nobody → Gavin Panella (allenap)

On 01/05/14 02:13, Raphaël Badin wrote:
> Btw, if you have a look at the dhcpd.leases file posted on a related bug
> (https://bugs.launchpad.net/maas/+bug/1153077/+attachment/3689992/+files/dhcpd.leases.gz),
> you'll see that, although the file weights 28M, there are only 48
> different IP<->MAC mappings in it. i.e. a *lot* of repetitions.
>

Apparently DHCPD re-writes the leases file every hour and that's a
compiled-in default.

I guess they didn't expect external code to be scanning it!

My DHCPD on trusty seems to be rotating lease files fine, there is no permission error anywhere.

My directory permissions are the same as rvb's above.

ubuntu@maas:/var/lib/maas$ ls -la
total 60
drwxr-xr-x 7 maas maas 4096 Mar 28 14:30 .
drwxr-xr-x 61 root root 4096 Jan 31 12:22 ..
drwxr-xr-x 14 root root 4096 May 7 16:03 boot-resources
-rw-r--r-- 1 maas maas 16384 May 8 11:29 celerybeat-cluster-schedule
-rw-r--r-- 1 maas maas 12288 May 8 10:20 celerybeat-region-schedule
drwxr-xr-x 2 root root 4096 May 8 10:19 dhcp
-rw-r--r-- 1 root root 4 Feb 13 13:17 dhcpd-interfaces
drwxr-xr-x 7 maas maas 4096 Feb 13 18:16 ephemeral
drwxr-xr-x 3 maas maas 4096 Jan 31 12:31 media
drwxr-xr-x 4 maas maas 4096 Mar 28 14:31 tftp
ubuntu@maas:/var/lib/maas$ ls -la dhcp
total 124
drwxr-xr-x 2 root root 4096 May 8 10:19 .
drwxr-xr-x 7 maas maas 4096 Mar 28 14:30 ..
-rw-r--r-- 1 root root 3316 May 8 10:19 dhcpd.leases

Greg Lutostanski (lutostag) wrote :

branch which fixes this and other 1.5.2 release bugs for SRU is linked to bug #1325038

description: updated
Changed in maas (Ubuntu):
status: New → Fix Released

Hello Blake, or anyone else affected,

Accepted maas into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/maas/1.5.2+bzr2282-0ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in maas (Ubuntu Trusty):
status: New → Fix Committed
tags: added: verification-needed
Jason Hobbs (jason-hobbs) wrote :

I tested an upgrade from 1.5.1 - upgraded, started an existing node, recommissioned an existing node, deleted the node, then enlisted/commissioned/installed to it without issue.

tags: added: verification-done
removed: verification-needed
Stéphane Graber (stgraber) wrote :

Hello Blake, or anyone else affected,

Accepted maas into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/maas/1.5.2+bzr2282-0ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package maas - 1.5.2+bzr2282-0ubuntu0.2

---------------
maas (1.5.2+bzr2282-0ubuntu0.2) trusty-proposed; urgency=medium

  * debian/control:
   - Add missing dependency in maas-cluster-controller for grub-common
     (LP: #1328231)
   - Move dependency from maas-cluster-controller to maas-provisioningserver
     for python-seamicroclient (LP: #1332532)

maas (1.5.2+bzr2282-0ubuntu0.1) trusty-proposed; urgency=medium

  * New upstream release:
    - Remove workaround for fixed Django bug 1311433 (LP: #1311433)
    - Ensure that validation errors are returned when adding a node over
      the API and its cluster controller is not contactable. (LP: #1305061)
    - Hardware enablement support for PowerKVM (LP: #1325038)
    - Shorten the time taken for a cluster to initially connect to the region
      via RPC to around 2 seconds (LP: #1317682)
    - Faster DHCP leases parser (LP: #1305102)
    - Documentation fixed explaining how to enable an ephemeral backdoor
      (LP: #1321696)
    - Use probe-and-enlist-hardware to enlist all virtual machine inside
      a libvirt machine, allow password qemu+ssh connections.
      (LP: #1315155, LP: #1315157)
    - Rename ppc64el boot loader to PowerKVM (LP: #1315154)
    - Fix NodeForm's is_valid() method so that it uses Django's way of setting
      errors on forms instead of putting text in self.errors['architecture']
      (LP: #1301465)
    - Change BootMethods to return their own IReader per-request, update method
      names to reflect new usage. (LP: #1315154)
    - Return early and stop the DHCP server when the list of managed interfaces
      of the nodegroup is empty. (LP: #1324944)
    - Fix invalid attribute references in the VirshSSH class. Added more test
      for the VirshSSH class. (LP: #1324966)
  * debian/control:
    - Add missing dependency in maas-cluster-controller for python-pexpect
      (LP: #1322151)
 -- Greg Lutostanski <email address hidden> Fri, 20 Jun 2014 10:10:47 -0500

Changed in maas (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for maas has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers