KVM + Glance + OS API 1.1 -- Image boot/networking problems

Bug #752008 reported by Wayne A. Walls
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

I'm having an odd issue where I glance upload an image, and I can then create an instance that pings/SSH, but whenever I try to create a second instance I get the following error:

##########
2011-04-05 17:15:49,070 INFO nova.virt.libvirt_conn [-] instance instance-00000039: Creating image
2011-04-05 17:15:49,142 DEBUG nova.utils [-] Attempting to grab semaphore "00000006" for method "call_if_not_exists"... from (pid=7537) inner /usr/lib/pymodules/python2.6/nova/utils.py:594
2011-04-05 17:15:49,143 DEBUG nova.utils [-] Running cmd (subprocess): qemu-img create -f qcow2 -o cluster_size=2M,backing_file=/var/lib/nova/instances/_base/00000006 /var/lib/nova/instances/instance-00000039/disk from (pid=7537) execute /usr/lib/pymodules/python2.6/nova/utils.py:150
2011-04-05 17:15:49,193 DEBUG nova.utils [-] Attempting to grab semaphore "local_20" for method "call_if_not_exists"... from (pid=7537) inner /usr/lib/pymodules/python2.6/nova/utils.py:594
2011-04-05 17:15:49,194 DEBUG nova.utils [-] Running cmd (subprocess): qemu-img create -f qcow2 -o cluster_size=2M,backing_file=/var/lib/nova/instances/_base/local_20 /var/lib/nova/instances/instance-00000039/disk.local from (pid=7537) execute /usr/lib/pymodules/python2.6/nova/utils.py:150
2011-04-05 17:15:49,245 INFO nova.virt.libvirt_conn [-] instance instance-00000039: injecting net into image 6
2011-04-05 17:15:49,256 DEBUG nova.utils [-] Running cmd (subprocess): sudo qemu-nbd -c /dev/nbd15 /var/lib/nova/instances/instance-00000039/disk from (pid=7537) execute /usr/lib/pymodules/python2.6/nova/utils.py:150
2011-04-05 17:15:50,287 DEBUG nova.utils [-] Running cmd (subprocess): sudo kpartx -a /dev/nbd15 from (pid=7537) execute /usr/lib/pymodules/python2.6/nova/utils.py:150
2011-04-05 17:15:50,302 DEBUG nova.utils [-] Running cmd (subprocess): sudo qemu-nbd -d /dev/nbd15 from (pid=7537) execute /usr/lib/pymodules/python2.6/nova/utils.py:150
2011-04-05 17:15:50,317 WARNING nova.virt.libvirt_conn [-] instance instance-00000039: ignoring error injecting data into image 6 (Mapped device was not found (we can only inject raw disk images): /dev/mapper/nbd15p1)
2011-04-05 17:15:52,252 DEBUG nova.virt.libvirt_conn [-] instance instance-00000039: is running from (pid=7537) spawn /usr/lib/pymodules/python2.6/nova/virt/libvirt_conn.py:570
2011-04-05 17:15:52,494 DEBUG nova.virt.libvirt_conn [-] instance instance-00000039: booted from (pid=7537) _wait_for_boot /usr/lib/pymodules/python2.6/nova/virt/libvirt_conn.py:581
##########

The second instances boots, and gets an IP, but I cannot ping or SSH to that instance. When I connect to the console via VNC, I get "Booting from Hard Disk... Boot failed: not a bootable disk -- No bootable device."

Now the first instance I had working is also hosed. From here on out, I cannot create any instances that 'boot', even though 'novatools' says they are up and have IP's:

root@colo07:/var/lib/glance/images# nova list
+----+------------------------+--------+-----------+---------------+
| ID | Name | Status | Public IP | Private IP |
+----+------------------------+--------+-----------+---------------+
| 46 | Tester01 | ACTIVE | | 184.106.53.9 |
| 47 | Tester02 | ACTIVE | | 184.106.53.10 |
| 48 | Tester03 | ACTIVE | | 184.106.53.11 |
| 49 | Tester04 | ACTIVE | | 184.106.53.12 |
| 55 | jason-test | ACTIVE | | 184.106.53.13 |
| 56 | jason-test-fedora-root | ACTIVE | | 184.106.53.14 |
| 57 | Tester05 | ACTIVE | | 184.106.53.15 |
+----+------------------------+--------+-----------+---------------+

One thing that I saw was " /dev/mapper/nbd15p1" being in more than one place. It appears this spot is not being released, and all new instances try to use this location? Not sure if that is useful, just an observation.

Any idea what could be causing this behavior?

Cheers

Revision history for this message
Brian Lamar (blamar) wrote :

Hey Wayne, there is a flag in Nova "max_nbd_devices" which, by default is set to 16. This is curiously the same as the nbd15 device mentioned.

Although I can't give you a fix because I don't know, well, a lot... :) but I can throw out some suggestions/questions:

1) How many VMs / disks do you have running right now? (it looks like only about 7?)

2) qemu-nbd, while in use, should put a pid file in /sys/block/nbdX/, so can you see if all of your /sys/block/nbdX/pid files exist? Something like `find /sys/block/nbd*/ -maxdepth 1 | grep pid` might work.

3) One work-around for you, while this is looked at is to try and increase the number of NBD's on your system. The only way I know of is to do a `modprobe -r nbd` and then `modprobe nbd nbds_max=32` (or some higher number). NOTE: I have absolutely no idea what this will do to all of your disks. You'd also have to increase the --max_nbd_devices flag to 32 in /etc/nova.conf.

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

Evening, Brian!

I'll give this a try in the AM, thanks for the quick reply! Sounds promising...

On your questions/suggestions

1) Anywhere from 1 to 20. Anytime I get past one booted though, it all goes to hell.
2) Will try
3) Not production, so I don't mind taking some risks :D

Cheers

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

Greetings!

I tried your suggestions, and am still unable to get new instances booted properly. Here is the latest from nova-compute.log:

##########
2011-04-06 10:21:54,408 DEBUG nova.utils [-] Running cmd (subprocess): sudo qemu-nbd -c /dev/nbd31 /var/lib/nova/instances/instance-0000003c/disk from (pid=1243) execute /usr/lib/pymodules/python2.6/nova/utils.py:150
2011-04-06 10:21:54,437 DEBUG nova.utils [-] Result was 1 from (pid=1243) execute /usr/lib/pymodules/python2.6/nova/utils.py:166
2011-04-06 10:21:54,438 WARNING nova.virt.libvirt_conn [-] instance instance-0000003c: ignoring error injecting data into image 6 (Unexpected error while running command.
Command: sudo qemu-nbd -c /dev/nbd31 /var/lib/nova/instances/instance-0000003c/disk
Exit code: 1
Stdout: ''
Stderr: "qemu-nbd: Could not access '/dev/nbd31': No such file or directory\n")
2011-04-06 10:21:56,152 DEBUG nova.virt.libvirt_conn [-] instance instance-0000003c: is running from (pid=1243) spawn /usr/lib/pymodules/python2.6/nova/virt/libvirt_conn.py:570
2011-04-06 10:21:56,367 DEBUG nova.virt.libvirt_conn [-] instance instance-0000003c: booted from (pid=1243) _wait_for_boot /usr/lib/pymodules/python2.6/nova/virt/libvirt_conn.py:581
##########

It's a little different error this time around, but still the same result.

Ideas?

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

From dmesg:

[ 1624.511795] nbd15: NBD_DISCONNECT
[ 1624.511885] nbd15: Receive control failed (result -32)
[ 1624.511987] nbd15: queue cleared
[ 1625.084450] type=1400 audit(1302104870.879:26): apparmor="DENIED" operation="open" parent=11552 profile="/usr/lib/libvirt/virt-aa-helper" name="/var/lib/nova/instances/_base/0000000d_sm" pid=12027 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=104
[ 1625.168959] type=1400 audit(1302104870.959:27): apparmor="STATUS" operation="profile_load" name="libvirt-1ca693e4-4c88-029d-0a0f-03b68a225e69" pid=12028 comm="apparmor_parser"
[ 1625.833724] device vnet2 entered promiscuous mode
[ 1625.835629] br100: port 4(vnet2) entering forwarding state
[ 1625.835632] br100: port 4(vnet2) entering forwarding state
[ 1635.970810] vnet2: no IPv6 routers present

I was reading this post as well, https://answers.launchpad.net/nova/+question/141960, but it looks like that was merged long ago. Anyone aware of a fix?

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

So this appears to have been solved, but by going back to a different method to upload images with glance:

When I use this method to upload images, I get errors/no boot:
glance add name='WayneUEC' is_public=true < /var/lib/glance/images/lucid2/ubuntu1010-UEC-localuser-image.tar.gz

When I use these, everything works as expected:
glance-upload --disk-format=ari --container-format=ari --type=ramdisk maverick-server-uec-amd64-loader ramdisk
glance-upload --disk-format=aki --container-format=aki --type=kernel maverick-server-uec-amd64-vmlinuz-virtual kernel
glance-upload --disk-format=ami --container-format=ami --type=machine --ramdisk=16 --kernel=17 maverick-server-uec-amd64.img machine

When I check out /var/lib/nova/instances/_base after the upload, and check out the files with qemu-img info, I get some radically different bits of information. It appears that the first command is not understanding how to take out the ramdisk/kernel and ready it for an image.

I believe this should be filed as a glance bug, unless there is something nova is doing/changed how it reads qcow images? Shrug...

Cheers

Revision history for this message
Jay Pipes (jaypipes) wrote :

I'm a little unsure why you are manually doing these uploads to Glance? Could you explain why you are doing that? The XS plugin calls glance-upload and manages the image properties it needs to track.

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

I'm not using XS, this is on Ubuntu 10.10 w/ KVM.

Revision history for this message
Jay Pipes (jaypipes) wrote :

I still don't understand why any manual use of Glance is required.

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

Hmm, I guess I don't understand what you don't understand. Can you glance-a-fy me and explain what I should be doing?

Revision history for this message
Jay Pipes (jaypipes) wrote :

This:

glance add name='WayneUEC' is_public=true < /var/lib/glance/images/lucid2/ubuntu1010-UEC-localuser-image.tar.gz

Creates 1 image in Glance called WayneUEC and it is a public image.

This:

glance-upload --disk-format=ari --container-format=ari --type=ramdisk maverick-server-uec-amd64-loader ramdisk
glance-upload --disk-format=aki --container-format=aki --type=kernel maverick-server-uec-amd64-vmlinuz-virtual kernel
glance-upload --disk-format=ami --container-format=ami --type=machine --ramdisk=16 --kernel=17 maverick-server-uec-amd64.img machine

Creates 3 images in Glance.

It's entirely different... so I'm wondering what you are expecting the first call to glance add to result in?

-jay

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

Ah, I see...that makes more sense to me...thanks. So I was expecting "glance add name='WayneUEC' is_public=true < /var/lib/glance/images/lucid2/ubuntu1010-UEC-localuser-image.tar.gz" to be like "uec-publish-tarball image.tgz" and give me a ready to boot image.

Now that I know this is not the case, so do I have to run the 'glance add' command three times (like glance-upload) to get the desired outcome?

Is there any discussion around glance having an uec-publish-tarball-esqe upload function?

Cheers

Revision history for this message
Jay Pipes (jaypipes) wrote :

"Now that I know this is not the case, so do I have to run the 'glance add' command three times (like glance-upload) to get the desired outcome?"

No. Use glance-upload, since that's the tool that was designed to deal with Glance with the EC2/eucatools way of bundling images. :)

"Is there any discussion around glance having an uec-publish-tarball-esqe upload function?"

No, but that's not to say you can't propose it for Diablo. ;)

I'd prefer not to have Glance's tools and interfaces be beholden to the EC2 way of doing things, though... I think the existing euca-bundle-image/euca-upload-bundle is overly complicated due to it's insistence on assuming the image store is S3.
-jay

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

Appears that this bug has been discussed, and decided that this is the real issue --> https://bugs.launchpad.net/glance/+bug/754221

Thanks for everyone's input!

cheers

Changed in nova:
status: New → Invalid
Revision history for this message
Maximiliano Caceres (ussarg-maxkcr) wrote :

I have the same problem
but I do not understand
I'm using euca-bundle-image, euca-upload-image, euca-register and euca-run-instance from a computer in a network to publish and run my images. How do I do this using glance?

Revision history for this message
Jay Pipes (jaypipes) wrote :

Hi Maximiliano!

What API are you using for Nova? EC2 or OpenStack?

Thanks!
jay

Revision history for this message
Maximiliano Caceres (ussarg-maxkcr) wrote :

I'm using EC2

Revision history for this message
Jay Pipes (jaypipes) wrote :

If you are using EC2, you do not need to do anything special other than making sure in your nova.conf, you set this:

--image_service=nova.image.glance.GlanceImageService
--glance_host=<GLANCE_API_HOST_IP>
--glance_port=<GLANCE_API_HOST_PORT>

Just make sure GLANCE_API_HOST_IP and GLANCE_API_HOST_PORT above are replaced with the appropriate IP address and port where you are running the Glance API server.

I wrote a lengthier explanation of the things that are happening "behind the scenes" in this Question: https://answers.launchpad.net/glance/+question/154793. You may find some of that explanation useful.

Cheers!
jay

Revision history for this message
Maximiliano Caceres (ussarg-maxkcr) wrote :

thanks!
but I had a new doubt...
How do I upload and register my images? what is the best method? and what that file format?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.