Node controller apache2 dependency satisfied by an incompatible mpm
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| eucalyptus (Ubuntu) |
Medium
|
Dustin Kirkland |
Bug Description
This is the log from the node controller. It clearly sees that it is supposed to be spawning an instance, and at least starts grabbing the images, but almost immediately just rm -rf's them.
It transpires that this is due to the images being incomplete in Walrus (ran out of disk space while uploading them). This suggests that some internal verification is taking place and rejecting the images. The NC should really log this fact, and if possible send a message back to the CC so it knows the instance setup failed (atm the instance shows as pending until some timeout is reached, when it is shown as terminated).
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006011]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
[Thu Mar 5 14:49:10 2009][006037]
Chris Jones (cmsj) wrote : | #1 |
description: | updated |
The reason there is no error message about the failed transfer is because the NC crashed in the middle of the transfer according to both log excerpts above. (The line that starts with `NC is looking for configuration...` is the first one that NC prints when it starts.) Since any incoming request from the CC causes NC to restart, the crashes are easy to miss.
So, this bug is not really about the lack of error messages (we do try to log everything that goes wrong) but about a crash. If this crash still happens with the recent version of the code, I would like to know more about the circumstances so that I can recreate it. I tried a few runs with NC running out of disk space during different stages of instance initialization, with both compressed and non-compressed Walrus downloads, but in all cases the failure to write was handled properly.
I'm less sure now that this relates to disk space. Due to unrelated hardware issues I've moved the NC to the same machine as the CC/CLC and this is still happening. I've nuked all of the images I had, switched from xen to kvm and uploaded the ttylinux image and the machine has >50GB spare, so this should have worked correctly (and the file in /var/lib/
Still, when I try to run an instance, the image is immediately removed and the NC appears to restart.
What more debugging information can I provide so we can figure out what's going on?
Daniel Nurmi (nurmi) wrote : | #4 |
I have discovered the reason behind this behavior. The node controller (an axis2 web service that is loaded and run by an apache2 thread) can only run as a single thread. This means that apache2 must be configured to only start one child (or thread) to run the NC. While we set up the NC's httpd.conf to limit the number of threads that apache will spawn to 1, there is one apache2 flavor that doesn't honor this configuration - apache2-itk.
So, the problem is not that a single NC is crashing, but that multiple NCs are starting that grabbing semaphores, creating a deadlock situation.
The solution is to install the 'non itk' version of apache2. We encountered this problem once, and the simple solution was 'shut down NC, apt-get install apache2, start up NC'.
is there a way to make sure that apache2-itk does not satisfy the node controller's dependency on 'apache2' for a more permanent fix? alternatively, is there a way to limit the number of threads that apache2-itk will spawn (we have thus far been unable to find a way to do this)? Either should solve the problem long-term, but we have not fully tested apache2-itk and so would posit that the former is preferable.
Chris Jones (cmsj) wrote : | #5 |
Daniel: That does appear to be the case here, yes, itk was installed. I didn't explicitly choose to install that, so apt must have settled on it when attempting to satisfy the apache dependency.
Soren: Presumably we could avoid this by either having eucalyptus-nc explicitly conflict with apache2-mpm-itk, or explicitly depend on apache2-
summary: |
- Node controller sometimes crashes + Node controller apache2 dependency satisfied by an incompatible mpm |
Chris Jones (cmsj) wrote : | #6 |
(a consultation with a packaging maven suggests that the explicit conflict and explicit dependency are both valid and indeed using both is not incorrect)
Changed in eucalyptus (Ubuntu): | |
status: | New → Triaged |
importance: | Undecided → Low |
assignee: | nobody → Colin Watson (cjwatson) |
status: | Triaged → Fix Committed |
importance: | Low → Medium |
Dustin Kirkland (kirkland) wrote : | #7 |
We can leave the depends on apache2 and add a conflicts on apache2-mpm-itk.
Committing a fix.
:-Dustin
Changed in eucalyptus (Ubuntu): | |
status: | Fix Committed → Triaged |
assignee: | Colin Watson (cjwatson) → Dustin Kirkland (kirkland) |
status: | Triaged → Fix Committed |
Launchpad Janitor (janitor) wrote : | #8 |
This bug was fixed in the package eucalyptus - 1.6~bzr854-0ubuntu6
---------------
eucalyptus (1.6~bzr854-
[ Thierry Carrez ]
* Apply upstream rev867 and rev876 diffs to fix SC registration through
Web UI, LP: #436313
* tools/eucalyptu
on a merge, LP: #435766
[ Dustin Kirkland ]
* debian/
displays the administration URL in the MOTD at position 80, LP: #436199
* debian/
installing the -nc does not have VT, LP: #426830
* debian/rules: don't fail package installation due to init script
failures, LP: #430075, #418473
* tools/euca_conf.in: vastly improve the output of
'euca_conf --register-nodes', which was missing some pertinent
information, LP: #424457
* clc/modules/
if we're going to use the local host to send email, use 'localhost' as
the hostname, rather than the externally resolvable hostname which
breaks in the default ubuntu postfix configuration, LP: #412676
* debian/control:
- have eucalyptus-common depend on openssh-server and openssh-client,
as these should really be installed on most any Eucalyptus system,
LP: #411656
- have eucalyptus-common recommend unzip, since Eucalyptus uses zip
files for credentials, which may be needed on various systems,
LP: #436876
- recommend libpam that provides pam_motd, LP: #436199
- conflict with apache2-mpm-itk, LP: #338344
* debian/
VNET_DHCPUSER appropriately for default Ubuntu on initial install,
LP: #364938
[ Colin Watson ]
* debian/control:
- Make eucalyptus-nc explicitly depend on apache2-mpm-worker |
apache2-
to only start a single child (LP: #338344).
-- Dustin Kirkland <email address hidden> Fri, 25 Sep 2009 18:01:29 -0700
Changed in eucalyptus (Ubuntu): | |
status: | Fix Committed → Fix Released |
After applying some manual changes suggested by grze, and switching to testing with KVM, I still have the same problem:
[Fri Mar 6 14:52:08 2009][004767] [EUCAINFO ] doRunInstance() invoked (id=i-36BA0714 cores=1 disk=1 memory=128 [EUCAINFO ] image=emi-94A3140C at http:// curium. cloud:8773/ services/ Walrus/ hardy-i386- rootfs/ hardy-i386- rootfs. manifest. xml [EUCAINFO ] krnel=eki-C8501449 at http:// curium. cloud:8773/ services/ Walrus/ hardy-i386- kernel/ vmlinuz- 2.6.24- 23-xen. manifest. xml [EUCAINFO ] rmdsk=eri-09BA1527 at http:// curium. cloud:8773/ services/ Walrus/ hardy-i386- initrd/ initrd. img-2.6. 24-23-xen. manifest. xml [EUCAINFO ] vlan=10 priMAC= d0:0d:36: BA:07:14 pubMAC= d0:0d:36: BA:07:14 [EUCAINFO ] network started for instance i-36BA0714 [EUCAINFO ] retrieving images for instance i-36BA0714... [EUCAINFO ] walrus_request(): downloading /opt/instances/ admin/i- 36BA0714/ disk-digest [EUCAINFO ] from http:// curium. cloud:8773/ services/ Walrus/ hardy-i386- rootfs/ hardy-i386- rootfs. manifest. xml [EUCADEBUG ] walrus_request(): writing GET output to /opt/instances/ admin/i- 36BA0714/ disk-digest [EUCAINFO ] NC is looking for configuration in //etc/eucalyptu s/eucalyptus. conf [EUCAINFO ] NC is looking for configuration in //etc/eucalyptu s/eucalyptus. conf [EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus- nc-xen- semaphore [EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus- nc-inst- semaphore [EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus- storage- semaphore [EUCAINFO ] SC is looking for configuration in //etc/eucalyptu s/eucalyptus. conf [EUCAINFO ] euca_init_cert(): using file //var/lib/ eucalyptus/ keys/node- cert.pem [EUCAINFO ] euca_init_cert(): using file //var/lib/ eucalyptus/ keys/node- pk.pem [EUCAINFO ] looking for existing KVM domains [EUCAINFO ] no currently running Xen domains to adopt [EUCAINFO ] checking the integrity of instances directory (/opt/instances) [EUCAINFO ] vrun(): [rm -rf /opt/instances/ admin/i- 36BA0714]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:08 2009][004767]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:10 2009][004772]
[Fri Mar 6 14:52:11 2009][004772]
[Fri Mar 6 14:52:11 2009][004772]
[Fri Mar 6 14:52:11 2009][004772]