Node controller apache2 dependency satisfied by an incompatible mpm

Bug #338344 reported by Chris Jones on 2009-03-05
4
Affects Status Importance Assigned to Milestone
eucalyptus (Ubuntu)
Medium
Dustin Kirkland 

Bug Description

This is the log from the node controller. It clearly sees that it is supposed to be spawning an instance, and at least starts grabbing the images, but almost immediately just rm -rf's them.

It transpires that this is due to the images being incomplete in Walrus (ran out of disk space while uploading them). This suggests that some internal verification is taking place and rejecting the images. The NC should really log this fact, and if possible send a message back to the CC so it knows the instance setup failed (atm the instance shows as pending until some timeout is reached, when it is shown as terminated).

[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] doRunInstance() invoked (id=i-3ECC080E cores=1 disk=1 memory=128
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] image=emi-94A3140C at http://curium.cloud:8773/services/Walrus/hardy-i386-rootfs/hardy-i386-rootfs.manifest.xml
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] krnel=eki-C8501449 at http://curium.cloud:8773/services/Walrus/hardy-i386-kernel/vmlinuz-2.6.24-23-xen.manifest.xml
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] rmdsk=eri-09BA1527 at http://curium.cloud:8773/services/Walrus/hardy-i386-initrd/initrd.img-2.6.24-23-xen.manifest.xml
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] vlan=10 priMAC=d0:0d:3E:CC:08:0E pubMAC=d0:0d:3E:CC:08:0E
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] network started for instance i-3ECC080E
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] retrieving images for instance i-3ECC080E...
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] walrus_request(): downloading /opt/instances/admin/i-3ECC080E/root-digest
[Thu Mar 5 14:49:10 2009][006011][EUCAINFO ] from http://curium.cloud:8773/services/Walrus/hardy-i386-rootfs/hardy-i386-rootfs.manifest.xml
[Thu Mar 5 14:49:10 2009][006011][EUCADEBUG ] walrus_request(): writing GET output to /opt/instances/admin/i-3ECC080E/root-digest
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] NC is looking for configuration in //etc/eucalyptus/eucalyptus.conf
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] NC is looking for configuration in //etc/eucalyptus/eucalyptus.conf
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus-nc-xen-semaphore
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus-nc-inst-semaphore
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus-storage-semaphore
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] SC is looking for configuration in //etc/eucalyptus/eucalyptus.conf
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] euca_init_cert(): using file //var/lib/eucalyptus/keys/node-cert.pem
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] euca_init_cert(): using file //var/lib/eucalyptus/keys/node-pk.pem
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] looking for existing Xen domains
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] checking the integrity of instances directory (/opt/instances)
[Thu Mar 5 14:49:10 2009][006037][EUCAINFO ] vrun(): [rm -rf /opt/instances/admin/i-3ECC080E]

Chris Jones (cmsj) wrote :

After applying some manual changes suggested by grze, and switching to testing with KVM, I still have the same problem:

[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] doRunInstance() invoked (id=i-36BA0714 cores=1 disk=1 memory=128
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] image=emi-94A3140C at http://curium.cloud:8773/services/Walrus/hardy-i386-rootfs/hardy-i386-rootfs.manifest.xml
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] krnel=eki-C8501449 at http://curium.cloud:8773/services/Walrus/hardy-i386-kernel/vmlinuz-2.6.24-23-xen.manifest.xml
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] rmdsk=eri-09BA1527 at http://curium.cloud:8773/services/Walrus/hardy-i386-initrd/initrd.img-2.6.24-23-xen.manifest.xml
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] vlan=10 priMAC=d0:0d:36:BA:07:14 pubMAC=d0:0d:36:BA:07:14
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] network started for instance i-36BA0714
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] retrieving images for instance i-36BA0714...
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] walrus_request(): downloading /opt/instances/admin/i-36BA0714/disk-digest
[Fri Mar 6 14:52:08 2009][004767][EUCAINFO ] from http://curium.cloud:8773/services/Walrus/hardy-i386-rootfs/hardy-i386-rootfs.manifest.xml
[Fri Mar 6 14:52:08 2009][004767][EUCADEBUG ] walrus_request(): writing GET output to /opt/instances/admin/i-36BA0714/disk-digest
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] NC is looking for configuration in //etc/eucalyptus/eucalyptus.conf
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] NC is looking for configuration in //etc/eucalyptus/eucalyptus.conf
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus-nc-xen-semaphore
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus-nc-inst-semaphore
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] sem_alloc(): cleaning up old semaphore eucalyptus-storage-semaphore
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] SC is looking for configuration in //etc/eucalyptus/eucalyptus.conf
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] euca_init_cert(): using file //var/lib/eucalyptus/keys/node-cert.pem
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] euca_init_cert(): using file //var/lib/eucalyptus/keys/node-pk.pem
[Fri Mar 6 14:52:10 2009][004772][EUCAINFO ] looking for existing KVM domains
[Fri Mar 6 14:52:11 2009][004772][EUCAINFO ] no currently running Xen domains to adopt
[Fri Mar 6 14:52:11 2009][004772][EUCAINFO ] checking the integrity of instances directory (/opt/instances)
[Fri Mar 6 14:52:11 2009][004772][EUCAINFO ] vrun(): [rm -rf /opt/instances/admin/i-36BA0714]

Chris Jones (cmsj) on 2009-03-12
description: updated

The reason there is no error message about the failed transfer is because the NC crashed in the middle of the transfer according to both log excerpts above. (The line that starts with `NC is looking for configuration...` is the first one that NC prints when it starts.) Since any incoming request from the CC causes NC to restart, the crashes are easy to miss.

So, this bug is not really about the lack of error messages (we do try to log everything that goes wrong) but about a crash. If this crash still happens with the recent version of the code, I would like to know more about the circumstances so that I can recreate it. I tried a few runs with NC running out of disk space during different stages of instance initialization, with both compressed and non-compressed Walrus downloads, but in all cases the failure to write was handled properly.

I'm less sure now that this relates to disk space. Due to unrelated hardware issues I've moved the NC to the same machine as the CC/CLC and this is still happening. I've nuked all of the images I had, switched from xen to kvm and uploaded the ttylinux image and the machine has >50GB spare, so this should have worked correctly (and the file in /var/lib/eucalyptus/bukkits is the right size).
Still, when I try to run an instance, the image is immediately removed and the NC appears to restart.

What more debugging information can I provide so we can figure out what's going on?

Daniel Nurmi (nurmi) wrote :

I have discovered the reason behind this behavior. The node controller (an axis2 web service that is loaded and run by an apache2 thread) can only run as a single thread. This means that apache2 must be configured to only start one child (or thread) to run the NC. While we set up the NC's httpd.conf to limit the number of threads that apache will spawn to 1, there is one apache2 flavor that doesn't honor this configuration - apache2-itk.

So, the problem is not that a single NC is crashing, but that multiple NCs are starting that grabbing semaphores, creating a deadlock situation.

The solution is to install the 'non itk' version of apache2. We encountered this problem once, and the simple solution was 'shut down NC, apt-get install apache2, start up NC'.

is there a way to make sure that apache2-itk does not satisfy the node controller's dependency on 'apache2' for a more permanent fix? alternatively, is there a way to limit the number of threads that apache2-itk will spawn (we have thus far been unable to find a way to do this)? Either should solve the problem long-term, but we have not fully tested apache2-itk and so would posit that the former is preferable.

Chris Jones (cmsj) wrote :

Daniel: That does appear to be the case here, yes, itk was installed. I didn't explicitly choose to install that, so apt must have settled on it when attempting to satisfy the apache dependency.

Soren: Presumably we could avoid this by either having eucalyptus-nc explicitly conflict with apache2-mpm-itk, or explicitly depend on apache2-mpm-prefork|apache2-mpm-worker ? (I wonder if the latter might be the best option since those are the two that are explicitly configured by eucalyptus to only run a single child)

summary: - Node controller sometimes crashes
+ Node controller apache2 dependency satisfied by an incompatible mpm
Chris Jones (cmsj) wrote :

(a consultation with a packaging maven suggests that the explicit conflict and explicit dependency are both valid and indeed using both is not incorrect)

Colin Watson (cjwatson) on 2009-09-25
Changed in eucalyptus (Ubuntu):
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Colin Watson (cjwatson)
status: Triaged → Fix Committed
importance: Low → Medium
Dustin Kirkland  (kirkland) wrote :

We can leave the depends on apache2 and add a conflicts on apache2-mpm-itk.

Committing a fix.

:-Dustin

Changed in eucalyptus (Ubuntu):
status: Fix Committed → Triaged
assignee: Colin Watson (cjwatson) → Dustin Kirkland (kirkland)
status: Triaged → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eucalyptus - 1.6~bzr854-0ubuntu6

---------------
eucalyptus (1.6~bzr854-0ubuntu6) karmic; urgency=low

  [ Thierry Carrez ]
  * Apply upstream rev867 and rev876 diffs to fix SC registration through
    Web UI, LP: #436313
  * tools/eucalyptus-java-ws.in: Reapply fix on boot messages that was lost
    on a merge, LP: #435766

  [ Dustin Kirkland ]
  * debian/80-eucalyptus-url: add an update-motd script that
    displays the administration URL in the MOTD at position 80, LP: #436199
  * debian/eucalyptus-nc.preinst: echo a warning message if a system
    installing the -nc does not have VT, LP: #426830
  * debian/rules: don't fail package installation due to init script
    failures, LP: #430075, #418473
  * tools/euca_conf.in: vastly improve the output of
    'euca_conf --register-nodes', which was missing some pertinent
    information, LP: #424457
  * clc/modules/www/src/main/java/edu/ucsb/eucalyptus/admin/server/ServletUtils.java:
    if we're going to use the local host to send email, use 'localhost' as
    the hostname, rather than the externally resolvable hostname which
    breaks in the default ubuntu postfix configuration, LP: #412676
  * debian/control:
    - have eucalyptus-common depend on openssh-server and openssh-client,
      as these should really be installed on most any Eucalyptus system,
      LP: #411656
    - have eucalyptus-common recommend unzip, since Eucalyptus uses zip
      files for credentials, which may be needed on various systems,
      LP: #436876
    - recommend libpam that provides pam_motd, LP: #436199
    - conflict with apache2-mpm-itk, LP: #338344
  * debian/eucalyptus-common.postinst: configure VNET_DHCPDAEMON and
    VNET_DHCPUSER appropriately for default Ubuntu on initial install,
    LP: #364938

  [ Colin Watson ]
  * debian/control:
    - Make eucalyptus-nc explicitly depend on apache2-mpm-worker |
      apache2-mpm-prefork, since the NC requires that Apache be configured
      to only start a single child (LP: #338344).

 -- Dustin Kirkland <email address hidden> Fri, 25 Sep 2009 18:01:29 -0700

Changed in eucalyptus (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers