upgrade from diablo leaves existing images with kernel unbootable

Bug #942865 reported by Scott Moser on 2012-02-28
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Vish Ishaya

Bug Description

After an upgrade to essex (ubuntu precise), some images registered via the ec2 api that had a kernel and ramdisk were left broken.

This was because the images were uploaded referencing a given aki-xxxxxxx and ari-xxxxxxx but after upgrade, those aki and ari changed, rendering the images broken.

I made some notes, saying:

##aki-0000009e smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-kernel.manifest.xml
##ari-0000009f smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-build0-loader.manifest.xml
## New (amis re-numbered after essex upgrade)
#aki-0000001e smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-kernel.manifest.xml
#ari-0000001f smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-build0-loader.manifest.xml

Subsequently, I've twice uploaded new images that reference a kernel/ramdisk that were present prior to the upgrade, and I get strange results:

$ euca-describe-images aki-0000001e ari-0000001f
IMAGE aki-0000001e smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-kernel.manifest.xml available public x86_64 kernel
IMAGE ari-0000001f smoser-lucid-loader/lucid-amd64-linux-image-2.6.32-34-virtual-v-2.6.32-34.77~smloader0-build0-loader.manifest.xml available public x86_64ramdisk instance-store

$ cloud-publish-tarball --kernel=aki-0000001e --ramdisk=ari-0000001f ubuntu-10.04-server-cloudimg-amd64.tar.gz smoser-cloud-images amd64
Tue Feb 28 17:55:14 UTC 2012: ====== extracting image ======
kernel : aki-0000001e
ramdisk: ari-0000001f
image : lucid-server-cloudimg-amd64.img
Tue Feb 28 17:55:23 UTC 2012: ====== bundle/upload image ======
Tue Feb 28 17:56:18 UTC 2012: ====== done ======
emi="ami-0000006f"; eri="ari-0000001f"; eki="aki-0000001e";

$ euca-describe-images ami-0000006f
IMAGE ami-0000006f smoser-cloud-images/ubuntu-lucid-10.04-amd64-server-20120221.manifest.xml available public x86_64 machine aki-00000061 ari-00000062

Note, that after cloud-publish-tarball registered with '1f' and '1e', the image reports using aki-00000061 and ari-00000062.

$ euca-run-instances ami-0000006f
ImageNotFound: Image 30 could not be found.

$ euca-describe-images aki-00000061
ImageNotFound: Image aki-00000061 could not be found.
$ euca-describe-images ari-00000062
ImageNotFound: Image ari-00000062 could not be found.

To make things even stranger, while the above happens now, soon after publish I verified that the images booted with identical 'run-instances' as above, and they did boot. Then from inside the instance, I see:

$ for p in ami-id kernel-id ramdisk-id; do echo $p: $(wget$p -q -O -); done
ami-id: ami-00000000
kernel-id: ami-00000000
ramdisk-id: ami-00000000

It appears the last bit may be a general essex bug as I see it on even other instances.

Scott Moser (smoser) wrote :

More strangeness possibly related:
$ euca-run-instances ami-0000006f
ImageNotFound: Image 112 could not be found.
$ euca-modify-image-attribute ami-0000006f --launch-permission --remove all
IMAGE ami-0000006f
$ euca-run-instances ami-0000006f
ImageNotFound: Image 114 could not be found.

Ie, making the image private, then public changes the image id of the error response.

I also realize that above, the instances I ran to verify, I did so when the image was private. Then, I made them public and they failed to run.

Scott Moser (smoser) wrote :

This seems to be a general problem with essex and an image that has a kernel/ramdisk associated with it, and making that image public.

To demonstrate, on a local devstack, I did:

$ cloud-publish-tarball files/cirros-0.3.0-x86_64-uec.tar.gz sm-bucket
Wed Feb 29 09:28:28 EST 2012: ====== extracting image ======
kernel : cirros-0.3.0-x86_64-vmlinuz
ramdisk: cirros-0.3.0-x86_64-initrd
image : cirros-0.3.0-x86_64-blank.img
Wed Feb 29 09:28:28 EST 2012: ====== bundle/upload kernel ======
Wed Feb 29 09:28:30 EST 2012: ====== bundle/upload ramdisk ======
Wed Feb 29 09:28:33 EST 2012: ====== bundle/upload image ======
Wed Feb 29 09:28:37 EST 2012: ====== done ======
emi="ami-0000000a"; eri="ari-00000009"; eki="aki-00000008";

$ euca-describe-images ami-0000000a
IMAGE ami-0000000a sm-bucket/cirros-0.3.0-x86_64-blank.img.manifest.xml available private x86_64 machine aki-00000008 ari-00000009 instance-store

$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000a

the instance ran fine. Then, try to make it public.

$ euca-modify-image-attribute --launch-permission --add all aki-00000008
IMAGE aki-00000008
$ euca-modify-image-attribute --launch-permission --add all ari-00000009
IMAGE ari-00000009
$ euca-modify-image-attribute --launch-permission --add all ami-0000000a
IMAGE ami-0000000a

making it public changes the ami and aki in describe-images output

$ euca-describe-images | grep sm-bucket
IMAGE aki-00000008 sm-bucket/cirros-0.3.0-x86_64-vmlinuz.manifest.xml available public x86_64 kernel instance-store
IMAGE ari-00000009 sm-bucket/cirros-0.3.0-x86_64-initrd.manifest.xml available public x86_64 ramdisk instance-store
IMAGE ami-0000000a sm-bucket/cirros-0.3.0-x86_64-blank.img.manifest.xml available public x86_64 machine aki-0000000b ari-0000000c instance-store

Try to run it again.
$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000a
ImageNotFound: Image 8 could not be found.

To reduce the scope a bit, try using already-public kernel/ramdisk:
$ cloud-publish-tarball files/cirros-0.3.0-x86_64-uec.tar.gz --kernel aki-00000008 --ramdisk ari-00000009 sm-bucket2
Wed Feb 29 10:54:41 EST 2012: ====== extracting image ======
kernel : aki-00000008
ramdisk: ari-00000009
image : cirros-0.3.0-x86_64-blank.img
Wed Feb 29 10:54:42 EST 2012: ====== bundle/upload image ======
Wed Feb 29 10:54:45 EST 2012: ====== done ======
emi="ami-0000000d"; eri="ari-00000009"; eki="aki-00000008";

$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000d

The new instance runs fine as private.

$ euca-modify-image-attribute --launch-permission --add all ami-0000000d
$ euca-describe-images ami-0000000d
IMAGE ami-0000000d sm-bucket2/cirros-0.3.0-x86_64-blank.img.manifest.xml available public x86_64 machine aki-0000000b ari-0000000c instance-store

Note, again, the aki and ari changed.
And now...
$ euca-run-instances --key mykey --instance-type m1.tiny ami-0000000d
ImageNotFound: Image 8 could not be found.

Changed in nova:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Vish Ishaya (vishvananda)
Changed in nova:
status: Triaged → In Progress
Scott Moser (smoser) wrote :

Just for reference, this is being fixed at

Changed in nova:
milestone: none → essex-rc1

Reviewed: https://review.openstack.org/4788
Committed: http://github.com/openstack/nova/commit/0d78045e72efe7313ca54e726dd403793eb30b52
Submitter: Jenkins
Branch: master

commit 0d78045e72efe7313ca54e726dd403793eb30b52
Author: Vishvananda Ishaya <email address hidden>
Date: Thu Mar 1 16:52:07 2012 -0800

    Fixes for ec2 images

     * Fixes s3 image service to convert back to uuids on update
     * Adds exception for attempt to update an unowned image
     * Adds error messages to ec2 for failure cases
     * Adds tests to verify changes
     * Fixes bug 942865

    Change-Id: I35331c635756f10c02b30dd43ab3fe0ad98bc28c

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-03-20
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-rc1 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers