OpenStack Compute (Nova)

upgrade from diablo leaves existing images with kernel unbootable

Reported by Scott Moser on 2012-02-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Vish Ishaya

Bug Description

After an upgrade to essex (ubuntu precise), some images registered via the ec2 api that had a kernel and ramdisk were left broken.

This was because the images were uploaded referencing a given aki-xxxxxxx and ari-xxxxxxx but after upgrade, those aki and ari changed, rendering the images broken.

I made some notes, saying:

##aki-0000009e smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-kernel.manifest.xml
##ari-0000009f smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-build0-loader.manifest.xml
#
## New (amis re-numbered after essex upgrade)
#aki-0000001e smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-kernel.manifest.xml
#ari-0000001f smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-build0-loader.manifest.xml

Subsequently, I've twice uploaded new images that reference a kernel/ramdisk that were present prior to the upgrade, and I get strange results:

$ euca-describe-images aki-0000001e ari-0000001f
IMAGE aki-0000001e smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-kernel.manifest.xml available public x86_64 kernel
                instance-store
IMAGE ari-0000001f smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-build0-loader.manifest.xml available public x86_64ramdisk instance-store

$ cloud-publish-tarball --kernel=aki-0000001e --ramdisk=ari-0000001f ubuntu-10.04-server-cloudimg-amd64.tar.gz smoser-cloud-images amd64
Tue Feb 28 17:55:14 UTC 2012: ====== extracting image ======
kernel : aki-0000001e
ramdisk: ari-0000001f
image : lucid-server-cloudimg-amd64.img
Tue Feb 28 17:55:23 UTC 2012: ====== bundle/upload image ======
Tue Feb 28 17:56:18 UTC 2012: ====== done ======
emi="ami-0000006f"; eri="ari-0000001f"; eki="aki-0000001e";

$ euca-describe-images ami-0000006f
IMAGE ami-0000006f smoser-cloud-images/ubuntu-lucid-10.04-amd64-server-20120221.manifest.xml available public x86_64 machine aki-00000061 ari-00000062

Note, that after cloud-publish-tarball registered with '1f' and '1e', the image reports using aki-00000061 and ari-00000062.

But
$ euca-run-instances ami-0000006f
ImageNotFound: Image 30 could not be found.

$ euca-describe-images aki-00000061
ImageNotFound: Image aki-00000061 could not be found.
$ euca-describe-images ari-00000062
ImageNotFound: Image ari-00000062 could not be found.

To make things even stranger, while the above happens now, soon after publish I verified that the images booted with identical 'run-instances' as above, and they did boot. Then from inside the instance, I see:

$ for p in ami-id kernel-id ramdisk-id; do echo $p: $(wget http://169.254.169.254/2009-04-04/meta-data/$p -q -O -); done
ami-id: ami-00000000
kernel-id: ami-00000000
ramdisk-id: ami-00000000

It appears the last bit may be a general essex bug as I see it on even other instances.

Scott Moser (smoser) wrote :

More strangeness possibly related:
$ euca-run-instances ami-0000006f
ImageNotFound: Image 112 could not be found.
$ euca-modify-image-attribute ami-0000006f --launch-permission --remove all
IMAGE ami-0000006f
$ euca-run-instances ami-0000006f
ImageNotFound: Image 114 could not be found.

Ie, making the image private, then public changes the image id of the error response.

I also realize that above, the instances I ran to verify, I did so when the image was private. Then, I made them public and they failed to run.

Scott Moser (smoser) wrote :

This seems to be a general problem with essex and an image that has a kernel/ramdisk associated with it, and making that image public.

To demonstrate, on a local devstack, I did:

$ cloud-publish-tarball files/cirros-0.3.0-x86_64-uec.tar.gz sm-bucket
Wed Feb 29 09:28:28 EST 2012: ====== extracting image ======
kernel : cirros-0.3.0-x86_64-vmlinuz
ramdisk: cirros-0.3.0-x86_64-initrd
image : cirros-0.3.0-x86_64-blank.img
Wed Feb 29 09:28:28 EST 2012: ====== bundle/upload kernel ======
Wed Feb 29 09:28:30 EST 2012: ====== bundle/upload ramdisk ======
Wed Feb 29 09:28:33 EST 2012: ====== bundle/upload image ======
Wed Feb 29 09:28:37 EST 2012: ====== done ======
emi="ami-0000000a"; eri="ari-00000009"; eki="aki-00000008";

$ euca-describe-images ami-0000000a
IMAGE ami-0000000a sm-bucket/cirros-0.3.0-x86_64-blank.img.manifest.xml available private x86_64 machine aki-00000008 ari-00000009 instance-store

$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000a

the instance ran fine. Then, try to make it public.

$ euca-modify-image-attribute --launch-permission --add all aki-00000008
IMAGE aki-00000008
$ euca-modify-image-attribute --launch-permission --add all ari-00000009
IMAGE ari-00000009
$ euca-modify-image-attribute --launch-permission --add all ami-0000000a
IMAGE ami-0000000a

making it public changes the ami and aki in describe-images output

$ euca-describe-images | grep sm-bucket
IMAGE aki-00000008 sm-bucket/cirros-0.3.0-x86_64-vmlinuz.manifest.xml available public x86_64 kernel instance-store
IMAGE ari-00000009 sm-bucket/cirros-0.3.0-x86_64-initrd.manifest.xml available public x86_64 ramdisk instance-store
IMAGE ami-0000000a sm-bucket/cirros-0.3.0-x86_64-blank.img.manifest.xml available public x86_64 machine aki-0000000b ari-0000000c instance-store

Try to run it again.
$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000a
ImageNotFound: Image 8 could not be found.

To reduce the scope a bit, try using already-public kernel/ramdisk:
$ cloud-publish-tarball files/cirros-0.3.0-x86_64-uec.tar.gz --kernel aki-00000008 --ramdisk ari-00000009 sm-bucket2
Wed Feb 29 10:54:41 EST 2012: ====== extracting image ======
kernel : aki-00000008
ramdisk: ari-00000009
image : cirros-0.3.0-x86_64-blank.img
Wed Feb 29 10:54:42 EST 2012: ====== bundle/upload image ======
Wed Feb 29 10:54:45 EST 2012: ====== done ======
emi="ami-0000000d"; eri="ari-00000009"; eki="aki-00000008";

$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000d

The new instance runs fine as private.

$ euca-modify-image-attribute --launch-permission --add all ami-0000000d
$ euca-describe-images ami-0000000d
IMAGE ami-0000000d sm-bucket2/cirros-0.3.0-x86_64-blank.img.manifest.xml available public x86_64 machine aki-0000000b ari-0000000c instance-store

Note, again, the aki and ari changed.
And now...
$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000d
ImageNotFound: Image 8 could not be found.

Changed in nova:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Vish Ishaya (vishvananda)
Changed in nova:
status: Triaged → In Progress
Scott Moser (smoser) wrote :

Just for reference, this is being fixed at
https://review.openstack.org/#change,4788

Changed in nova:
milestone: none → essex-rc1

Reviewed: https://review.openstack.org/4788
Committed: http://github.com/openstack/nova/commit/0d78045e72efe7313ca54e726dd403793eb30b52
Submitter: Jenkins
Branch: master

commit 0d78045e72efe7313ca54e726dd403793eb30b52
Author: Vishvananda Ishaya <email address hidden>
Date: Thu Mar 1 16:52:07 2012 -0800

    Fixes for ec2 images

     * Fixes s3 image service to convert back to uuids on update
     * Adds exception for attempt to update an unowned image
     * Adds error messages to ec2 for failure cases
     * Adds tests to verify changes
     * Fixes bug 942865

    Change-Id: I35331c635756f10c02b30dd43ab3fe0ad98bc28c

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-03-20
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-rc1 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers