Glance + SSL - Image download errors

Bug #1340993 reported by Kris Lindgren on 2014-07-11
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Glance
Undecided
Unassigned

Bug Description

Hello,

I have a latest stable havana (2013.2.3) openstack setup and I am noticing issues occasionally when downloading new backing files for vm's to compute nodes. I will occasionally end up with vm's that are stuck spawning, upon investigation I can see the backing file under /var/nova/instances/_base/<sha1 of imageuuid>.part is created but is only partially downloaded and hasn't been update in some time (some times days). Side not - you are unable to a delete a vm in this state successfully - it will always be stuck in deleting, until you restart nova-compute on the compute node and perform the delete again.

I have managed to create some scripts that will replicate the issue multiple ways. The image files that I have been testing with are 8.8gb, 8.6gb and a large 60gb image (however another larger 8gb image would also duplicate the issue).

The first script: https://gist.github.com/krislindgren/fc519aa03d350f42e9e6#file-multiboot-sh

Will take the image files that you give it and will deploy a vm per image file to the compute node that you have specified. With SSL enabled typically only 1 VM will ever boot successfully. Errors here will range from failed (md5sum mismatches) image downloads to backing files that are only partially downloaded. To narrow down the issue I switched over to using the glance client to do image downloads.

The second script: https://gist.github.com/krislindgren/fc519aa03d350f42e9e6#file-multi-img-download-sh

Will take the images specified on the command line and run the glance image-download command in a parallel bash subshell. This script removes nova from the mix. However, errors seen here are the same as what I have seen with the first script.

The thrid script: https://gist.github.com/krislindgren/fc519aa03d350f42e9e6#file-multi-img-download-newclient-sh

Uses: https://gist.github.com/krislindgren/fc519aa03d350f42e9e6#file-client-py instead of the glance cli to download the image. I believe it also uses a different download library as well. WIth this client I will usually get 2 successful images downloads (sometimes 3), but the issue still exists.

With all the scripts, and after a lot of testing I have found that this issue is 100% re-producible when trying to download 3 images at the same time. But I have also noticed in production that this issue happens when only downloading a single image on a compute node.

Kris Lindgren (klindgren) wrote :

I should add some more detail about our setup. SSL is not being offloaded in any environment and is being handled via the glance-api and glance-registry services. We increased the number of workers to 40, to better handle multiple downloads/SSL overhead. In production we are using F5’s or A10’s for load balancing in our dev/test/stage environments we are using haproxy. The issue exists in all environments. Also, in testing it did not matter the number of glance-api servers we had in rotation. To simplify troubleshooting, I had disabled glance-api on all but one server. So most of the testing was done from a single compute node using multiple clients to a single glance-api instance (with 40 workers). To add some additional detail I am running on Centos 6.5, and I have already tried upgrading eventlet, greenlet, pyOpenSSL, pycryptography to their latest versions on both the client and the server and it did not help.

If we turn off ssl in glance-api and the client, then the 3 simultaneous downloads work without issue.

Kris Lindgren (klindgren) wrote :

I offloaded SSL on to haproxy for glance-api and disabled ssl in the glance-api.conf. After doing this I was able to successfully download glance images using the stock glance client without any issue. So the problem appears to be in the glance server side of things. I have attached the output of a pip freeze to make it easy to see what python module versions are at. As I mentioned above I updated to the latest versions of: eventlet, greenlet, pyOpenssl, cryptography and this issue still persisted.

Mark Washenberger (markwash) wrote :

Hi Kris,

Thanks for this report. I've been trying to reproduce this issue on devstack (it takes a little work but if you modify the glance endpoints in the keystone endpoints table, create a key and cert, and reconfigure glance to use them (cert_file and key_file)).

I'm not able to reproduce the problem using latest master, but I've only tested with 1 GB images.

Can you try again with the latest code, and possibly try to reproduce the problem on devstack?

Changed in glance:
status: New → Incomplete
Kris Lindgren (klindgren) wrote :

Mark,

I haven't had time to work on this anymore. Since we moved to ssl termination at another layer everything is working fine. We upgraded to icehouse and with ssl termination in haproxy its still working without any issues. Saddly - I wont have anytime in the foreseeable future to revert the changes in any of my environments to test if the issue still exists. The only suggestion I would have is to use a larger image size. The issue was extremely reproducible with 3 concurrent transfers and a number of workers configured (in our case 40).

Barrow Kwan (barrowkwan) wrote :

I am having same problem on Juno. When I turned on SSL and download image with size 80G, it stopped at around 64G. I tried other image with size less than 5G and they are ok. As soon as I turn off SSL, I can download that 80G image fine.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers