instance stays in pending for > 1 hour, then to terminated

Bug #532682 reported by Scott Moser on 2010-03-05
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Eucalyptus
Fix Released
Undecided
Unassigned
eucalyptus (Ubuntu)
High
Dustin Kirkland 
Nominated for Hardy by Solayappan Adaikkalavan
Lucid
High
Dustin Kirkland 

Bug Description

On a fresh install of lucid eucalyptus, I did the following:

A. download lucid 20100303 (http://uec-images.ubuntu.com/lucid/)
B. publish it
   uec-publish-tarball lucid-server-uec-amd64.tar.gz lucid-20100303
C. extract
   tar -Sxvzf lucid-server-uec-amd64.tar.gz
D. mount image loopback, install a new upstart
  sudo mount -o loop lucid-server-uec-amd64.img /mnt/
  cp upstart-* /mnt/tmp
  sudo chroot /mnt
  % LANG=C
  % dpkg -i /tmp/upstart_*
  % rm /tmp/upstart_*
  % exit
  sudo umount /mnt
E. register new image with kernel from the "dist" version
   $ uec-publish-image x86_64 --kernel eki-68881AF2 \
      lucid-server-uec-amd64.img lucid-20100303-d-upstart
   emi-24C619C3 lucid-20100303-d-upstart/lucid-server-uec-amd64.img.manifest.xml

F. run instance of new image
   euca-run-instances --key mykey --instance-type c1.medium emi-24C619C3

After 1 hour 15 minutes, image was still pending
After another 30 or so, it terminated.

Note, I doubt that steps 'C' and 'D' above are required, but have not tested that.

Thierry Carrez (ttx) wrote :

Looking at the logs:

NC fails to get the DecryptedImage from Walrus:

walrus_request(): writing GET/Get DecryptedImage output to /var/lib/eucalyptus/instances//admin/i-3DB40784/disk
walrus_request(): wrote 30 bytes in 1 writes
walrus_request(): server responded with HTTP code 408 (timeout)
                  download retry 10 of 10 will commence in 4 seconds

It stays pending while retrying 10 times... then after a couple hours it stops trying and terminates request.

On the Walrus side:

ERROR WalrusImageManager | WalrusImageManager.java:927 Tired of waiting to cache image: lucid-20100303-d-upstart/lucid-server-uec-amd64.img.manifest.xml giving up

Scott Moser (smoser) wrote :

I'm attaching steps that I did after the lucid-image steps above with a karmic image.

Note, that the karmic image has a ramdisk. This didn't seem to change anything, the karmic image fails to boot also.

Changed in eucalyptus (Ubuntu):
importance: Undecided → High
status: New → Confirmed
Scott Moser (smoser) on 2010-03-05
description: updated
Scott Moser (smoser) wrote :

I followed the attachd file (karmic steps...) on a fresh install of Eucalyptus and it recreates the problem. That seems to me to rule out the "ramdiskless" scapegoat.

$ euca-describe-images
IMAGE eki-68691AF0 karmic-20100127/karmic-server-uec-amd64-vmlinuz-virtual.manifest.xml admin available public x86_64 kernel
IMAGE emi-193B15E5 karmic-20100127/karmic-server-uec-amd64.img.manifest.xml
        admin available public x86_64 machine eki-68691AF0 eri-45D21A75
IMAGE emi-248C19B8 karmic-20100127-d-upstart/karmic-server-uec-amd64.img.manifest.xml admin available public x86_64 machine eki-68691AF0 eri-45D21A75
IMAGE eri-45D21A75 karmic-20100127/karmic-server-uec-amd64-initrd-virtual.manifest.xml admin available public x86_64 ramdisk

$ euca-describe-instances
RESERVATION r-557808CA admin default
INSTANCE i-41400767 emi-248C19B8 10.1.1.100 172.19.1.2 pending mykey 0 c1.medium 2010-03-05T16:28:57.807Z cluster1 eki-68691AF0 eri-45D21A75

$ date --utc
Fri Mar 5 16:35:47 UTC 2010

Its been "pending" for 7 minutes at the moment.

Scott Moser (smoser) wrote :

Dustin and Daniel worked on this a bit more on friday and seem to have at least found a workaround. The database was getting stuck. Per Dustin, it at least appears that this is not a uec-publish-[image,tarball] as I had thought it might be.

Dustin Kirkland  (kirkland) wrote :

Dan-

This is the issue you and I looked at on Friday.

I believe we decided that Eucalyptus was blowing up somewhere during the image decryption, and then the cloud/database worked itself into a nasty state.

We restarted eucalyptus several times, and eventually we were able to register the very same images that were failing.

I'm assigning you this bug, as you said this was something you needed to discuss with Chris.

This bug is marked High and Confirmed because it's blocking Scott's work on creating/verifying Lucid UEC images.

Changed in eucalyptus (Ubuntu Lucid):
assignee: nobody → Daniel Nurmi (nurmi)
Scott Moser (smoser) wrote :

I'm not sure why I'm so lucky, but I hit this more times than not...I'm almost completely unable to register an image that will boot.

It seems critical to me, as it blocks all usage of the installed UEC.

Neil Soman (neilsoman) wrote :

------------------------------------------------------------
revno: 1208
fixes bug(s): https://launchpad.net/bugs/532682
committer: Neil
branch nick: 1.6.2
timestamp: Mon 2010-03-08 14:39:06 -0800
message:
  fixes LP: #532682
------------------------------------------------------------

Changed in eucalyptus:
status: New → Fix Committed
Changed in eucalyptus (Ubuntu Lucid):
status: Confirmed → In Progress
assignee: Daniel Nurmi (nurmi) → Dustin Kirkland (kirkland)
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eucalyptus - 1.6.2-0ubuntu12

---------------
eucalyptus (1.6.2-0ubuntu12) lucid; urgency=low

  * Cherry-pick merge from upstream 1.6.2, now on revision 1208
    - LP: #532682 - fix long, pending instances that go straight to
      'terminated'
 -- Dustin Kirkland <email address hidden> Mon, 08 Mar 2010 17:34:24 -0600

Changed in eucalyptus (Ubuntu Lucid):
status: Fix Committed → Fix Released
nipun (nipunsehrawatns) wrote :

I had the exact same problem with image decryption at walrus. The kernel and ramdisk are transfered fine to the nc. After reading this bug report, I upgraded to Lucid Lynx from Karmic, yesterday. But the same problem exists.

I am struck on this from 5 days now, pleas tell me if there is any work-around available such as not encrypting the image while uploading it to Walrus.

Scott Moser (smoser) wrote :

@nipun , please open a new bug, as this one is fix-released. It really *as* fix released too, the problem that was solved was due to the manifest path including the string 'tar' in it.

Please do open a new bug and report your problem.
Ideally, use
1. recreate your failure
2. use 'ubuntu-bug eucalyptus-common' on the cloud controller to report it and have it send the logs.

Changed in eucalyptus:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers