Glance + SSL - Image download errors
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Glance |
Undecided
|
Unassigned |
Bug Description
Hello,
I have a latest stable havana (2013.2.3) openstack setup and I am noticing issues occasionally when downloading new backing files for vm's to compute nodes. I will occasionally end up with vm's that are stuck spawning, upon investigation I can see the backing file under /var/nova/
I have managed to create some scripts that will replicate the issue multiple ways. The image files that I have been testing with are 8.8gb, 8.6gb and a large 60gb image (however another larger 8gb image would also duplicate the issue).
The first script: https:/
Will take the image files that you give it and will deploy a vm per image file to the compute node that you have specified. With SSL enabled typically only 1 VM will ever boot successfully. Errors here will range from failed (md5sum mismatches) image downloads to backing files that are only partially downloaded. To narrow down the issue I switched over to using the glance client to do image downloads.
The second script: https:/
Will take the images specified on the command line and run the glance image-download command in a parallel bash subshell. This script removes nova from the mix. However, errors seen here are the same as what I have seen with the first script.
The thrid script: https:/
Uses: https:/
With all the scripts, and after a lot of testing I have found that this issue is 100% re-producible when trying to download 3 images at the same time. But I have also noticed in production that this issue happens when only downloading a single image on a compute node.
Kris Lindgren (klindgren) wrote : | #1 |
Kris Lindgren (klindgren) wrote : | #2 |
I offloaded SSL on to haproxy for glance-api and disabled ssl in the glance-api.conf. After doing this I was able to successfully download glance images using the stock glance client without any issue. So the problem appears to be in the glance server side of things. I have attached the output of a pip freeze to make it easy to see what python module versions are at. As I mentioned above I updated to the latest versions of: eventlet, greenlet, pyOpenssl, cryptography and this issue still persisted.
Mark Washenberger (markwash) wrote : | #3 |
Hi Kris,
Thanks for this report. I've been trying to reproduce this issue on devstack (it takes a little work but if you modify the glance endpoints in the keystone endpoints table, create a key and cert, and reconfigure glance to use them (cert_file and key_file)).
I'm not able to reproduce the problem using latest master, but I've only tested with 1 GB images.
Can you try again with the latest code, and possibly try to reproduce the problem on devstack?
Changed in glance: | |
status: | New → Incomplete |
Kris Lindgren (klindgren) wrote : | #4 |
Mark,
I haven't had time to work on this anymore. Since we moved to ssl termination at another layer everything is working fine. We upgraded to icehouse and with ssl termination in haproxy its still working without any issues. Saddly - I wont have anytime in the foreseeable future to revert the changes in any of my environments to test if the issue still exists. The only suggestion I would have is to use a larger image size. The issue was extremely reproducible with 3 concurrent transfers and a number of workers configured (in our case 40).
Barrow Kwan (barrowkwan) wrote : | #5 |
I am having same problem on Juno. When I turned on SSL and download image with size 80G, it stopped at around 64G. I tried other image with size less than 5G and they are ok. As soon as I turn off SSL, I can download that 80G image fine.
I should add some more detail about our setup. SSL is not being offloaded in any environment and is being handled via the glance-api and glance-registry services. We increased the number of workers to 40, to better handle multiple downloads/SSL overhead. In production we are using F5’s or A10’s for load balancing in our dev/test/stage environments we are using haproxy. The issue exists in all environments. Also, in testing it did not matter the number of glance-api servers we had in rotation. To simplify troubleshooting, I had disabled glance-api on all but one server. So most of the testing was done from a single compute node using multiple clients to a single glance-api instance (with 40 workers). To add some additional detail I am running on Centos 6.5, and I have already tried upgrading eventlet, greenlet, pyOpenSSL, pycryptography to their latest versions on both the client and the server and it did not help.
If we turn off ssl in glance-api and the client, then the 3 simultaneous downloads work without issue.