celeryd 100% cpu when large dhcpd lease file
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| MAAS |
Critical
|
Gavin Panella | ||
| 1.5 |
Critical
|
Gavin Panella | ||
| Trunk |
Critical
|
Gavin Panella | ||
| maas (Ubuntu) |
Undecided
|
Unassigned | ||
| Trusty |
Undecided
|
Unassigned |
Bug Description
maas-cluster-celery service eats CPU time, as one or more celeryd processes consumes 100% of the cpu. This looks to be due to the dhcpd.lease file being very large, as the time it takes to complete the upload_dhcp_leases file is excessive.
[2014-04-09 14:05:28,405: INFO/MainProcess] Task provisioningser
ubuntu@
total 18M
-rw-r--r-- 1 root root 17M Apr 9 14:06 dhcpd.leases
-rw-r--r-- 1 root root 366K Apr 3 19:39 dhcpd.leases~
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
34283 maas 20 0 1646232 1.464g 1896 R 99.9 18.9 12:15.58 celeryd
34284 maas 20 0 1970464 1.773g 1896 R 99.9 22.9 14:49.22 celeryd
34259 maas 20 0 140572 17980 1912 R 99.5 0.2 3:32.11 celeryd
36722 maas 20 0 1592204 1.414g 2444 R 99.5 18.2 4:25.09 celeryd
37773 maas 20 0 591952 482116 2380 R 99.2 5.9 0:43.97 celeryd
This was noticed in an environment that is running 4 parallel deployments with 3 to 4 LXC containers per deployment in a constant loop, 24/7.
[Impact]
* More efficient lease parser to work on larger deployments -- previously would be CPU bound in these cases and sometimes causing timeouts
[Testcase]
* Install maas; ensure new lease parser works by detecting nodes ip addresses correctly for the maas managed dhcp provider
[Regression Potential]
* Large at first blush, but this has had lots of testing; both from unittests as well as maasCI and Openstack Interoperability Lab
Related branches
- Raphaël Badin (community): Approve on 2014-05-06
-
Diff: 773 lines (+407/-212)2 files modifiedsrc/provisioningserver/dhcp/leases_parser_fast.py (+87/-0)
src/provisioningserver/dhcp/tests/test_leases_parser.py (+320/-212)
- Gavin Panella (community): Approve on 2014-05-19
-
Diff: 12 lines (+1/-1)1 file modifiedsrc/provisioningserver/dhcp/leases.py (+1/-1)
- Gavin Panella (community): Approve on 2014-05-19
-
Diff: 773 lines (+407/-212)2 files modifiedsrc/provisioningserver/dhcp/leases_parser_fast.py (+87/-0)
src/provisioningserver/dhcp/tests/test_leases_parser.py (+320/-212)
- Gavin Panella (community): Approve on 2014-05-19
-
Diff: 12 lines (+1/-1)1 file modifiedsrc/provisioningserver/dhcp/leases.py (+1/-1)
description: | updated |
Blake Rouse (blake-rouse) wrote : | #1 |
Gavin Panella (allenap) wrote : | #2 |
This is related to bug 1153077, and is probably a dupe of it.
Blake Rouse (blake-rouse) wrote : | #3 |
I think it is some what related, but in this case the queue was actually not filling up, it was doing it just fast enough to not cause that issue. But each celeryd process was consuming larges amounts of cpu time.
Raphaël Badin (rvb) wrote : | #4 |
I think Gavin is right: they are related. It's just that since bug 1153077, we set an expiration date of the tasks so they don't really pile up if they can't be processed in time. But the task in question is fired every minute; if it takes ~147.9s to complete, it will take hold of all celery's workers pretty quickly.
Raphaël Badin (rvb) wrote : | #5 |
I'm testing something completely unrelated and, in /var/log/syslog, I have this:
Apr 25 11:49:15 3d8eb172-
The message seems to indicate that the dhcp server is failing to "rotate" its lease file. That might explain why we end of with huge lease files.
Julian Edwards (julian-edwards) wrote : | #6 |
Can you check the containing directory's permissions? Remember that in the precise → raring transition we went from root-owned to dhcpd-owned directories as the dhcpd now drops permissions as needed.
Raphaël Badin (rvb) wrote : | #7 |
rvb@maas:~$ ls -la /var/lib/maas/dhcp/
total 16
drwxr-xr-x 2 root root 4096 Apr 30 10:01 .
drwxr-xr-x 5 maas maas 4096 Apr 29 18:38 ..
-rw-r--r-- 1 root root 1228 Apr 30 14:50 dhcpd.leases
-rw-r--r-- 1 root root 817 Apr 30 09:58 dhcpd.leases~
rvb@maas:~$ ls -la /var/lib/maas/
total 52
drwxr-xr-x 5 maas maas 4096 Apr 29 18:38 .
drwxr-xr-x 46 root root 4096 Apr 29 18:32 ..
drwxr-xr-x 5 maas maas 4096 Apr 30 11:31 boot-resources
-rw-r--r-- 1 maas maas 16384 Apr 30 16:08 celerybeat-
-rw-r--r-- 1 maas maas 12288 Apr 30 10:01 celerybeat-
drwxr-xr-x 2 root root 4096 Apr 30 10:01 dhcp
-rw-r--r-- 1 root root 4 Apr 29 18:38 dhcpd-interfaces
drwxr-xr-x 2 maas maas 4096 Apr 29 18:34 media
Raphaël Badin (rvb) wrote : | #8 |
Btw, if you have a look at the dhcpd.leases file posted on a related bug (https:/
Changed in maas: | |
status: | Triaged → In Progress |
assignee: | nobody → Gavin Panella (allenap) |
Julian Edwards (julian-edwards) wrote : Re: [Bug 1305102] Re: celeryd 100% cpu when large dhcpd lease file | #9 |
On 01/05/14 02:13, Raphaël Badin wrote:
> Btw, if you have a look at the dhcpd.leases file posted on a related bug
> (https:/
> you'll see that, although the file weights 28M, there are only 48
> different IP<->MAC mappings in it. i.e. a *lot* of repetitions.
>
Apparently DHCPD re-writes the leases file every hour and that's a
compiled-in default.
I guess they didn't expect external code to be scanning it!
Julian Edwards (julian-edwards) wrote : | #10 |
My DHCPD on trusty seems to be rotating lease files fine, there is no permission error anywhere.
My directory permissions are the same as rvb's above.
ubuntu@
total 60
drwxr-xr-x 7 maas maas 4096 Mar 28 14:30 .
drwxr-xr-x 61 root root 4096 Jan 31 12:22 ..
drwxr-xr-x 14 root root 4096 May 7 16:03 boot-resources
-rw-r--r-- 1 maas maas 16384 May 8 11:29 celerybeat-
-rw-r--r-- 1 maas maas 12288 May 8 10:20 celerybeat-
drwxr-xr-x 2 root root 4096 May 8 10:19 dhcp
-rw-r--r-- 1 root root 4 Feb 13 13:17 dhcpd-interfaces
drwxr-xr-x 7 maas maas 4096 Feb 13 18:16 ephemeral
drwxr-xr-x 3 maas maas 4096 Jan 31 12:31 media
drwxr-xr-x 4 maas maas 4096 Mar 28 14:31 tftp
ubuntu@
total 124
drwxr-xr-x 2 root root 4096 May 8 10:19 .
drwxr-xr-x 7 maas maas 4096 Mar 28 14:30 ..
-rw-r--r-- 1 root root 3316 May 8 10:19 dhcpd.leases
Greg Lutostanski (lutostag) wrote : | #11 |
branch which fixes this and other 1.5.2 release bugs for SRU is linked to bug #1325038
description: | updated |
Changed in maas (Ubuntu): | |
status: | New → Fix Released |
Hello Blake, or anyone else affected,
Accepted maas into trusty-proposed. The package will build now and be available at http://
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-
Further information regarding the verification process can be found at https:/
Changed in maas (Ubuntu Trusty): | |
status: | New → Fix Committed |
tags: | added: verification-needed |
Jason Hobbs (jason-hobbs) wrote : | #13 |
I tested an upgrade from 1.5.1 - upgraded, started an existing node, recommissioned an existing node, deleted the node, then enlisted/
tags: |
added: verification-done removed: verification-needed |
Stéphane Graber (stgraber) wrote : | #14 |
Hello Blake, or anyone else affected,
Accepted maas into trusty-proposed. The package will build now and be available at http://
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-
Further information regarding the verification process can be found at https:/
tags: | removed: verification-done |
tags: | added: verification-needed |
tags: |
added: verification-done removed: verification-needed |
Launchpad Janitor (janitor) wrote : | #15 |
This bug was fixed in the package maas - 1.5.2+bzr2282-
---------------
maas (1.5.2+
* debian/control:
- Add missing dependency in maas-cluster-
(LP: #1328231)
- Move dependency from maas-cluster-
for python-
maas (1.5.2+
* New upstream release:
- Remove workaround for fixed Django bug 1311433 (LP: #1311433)
- Ensure that validation errors are returned when adding a node over
the API and its cluster controller is not contactable. (LP: #1305061)
- Hardware enablement support for PowerKVM (LP: #1325038)
- Shorten the time taken for a cluster to initially connect to the region
via RPC to around 2 seconds (LP: #1317682)
- Faster DHCP leases parser (LP: #1305102)
- Documentation fixed explaining how to enable an ephemeral backdoor
(LP: #1321696)
- Use probe-and-
a libvirt machine, allow password qemu+ssh connections.
(LP: #1315155, LP: #1315157)
- Rename ppc64el boot loader to PowerKVM (LP: #1315154)
- Fix NodeForm's is_valid() method so that it uses Django's way of setting
errors on forms instead of putting text in self.errors[
(LP: #1301465)
- Change BootMethods to return their own IReader per-request, update method
names to reflect new usage. (LP: #1315154)
- Return early and stop the DHCP server when the list of managed interfaces
of the nodegroup is empty. (LP: #1324944)
- Fix invalid attribute references in the VirshSSH class. Added more test
for the VirshSSH class. (LP: #1324966)
* debian/control:
- Add missing dependency in maas-cluster-
(LP: #1322151)
-- Greg Lutostanski <email address hidden> Fri, 20 Jun 2014 10:10:47 -0500
Changed in maas (Ubuntu Trusty): | |
status: | Fix Committed → Fix Released |
The verification of the Stable Release Update for maas has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.
Number of unique entries in the dhcpd.lease file.
ubuntu@ maas-trusty: ~$ grep hardware /var/lib/ maas/dhcp/ dhcpd.leases | awk '{print $3}' | sort | uniq | wc -l
3931