OpenStack Image Registry and Delivery Service (Glance)

Backend storage disconnections do not raise exceptions

Reported by Tom Hancock on 2011-10-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Glance
High
Jay Pipes

Bug Description

We'd like to suggest an improvement to how errors in a glance API server's backend are detected.

Consider a configuration where a Nova client node makes a request of a Glance API server which
has a Swift backend. The Glance API server will in turn make a request of the Swift proxy, which will
then make a request of a Swift object server (one of three). Once the object server responds,
a http response code of 200 is optimistically returned from the Swift proxy server to the Glance API server.
Then the data starts to be streamed from the object server in ‘chunks’ to the proxy and from
there to the Glance API server and from there to the Nova client.
If, for any reason, the object server ceases to respond (its overloaded, it crashes and so on)
then the proxy server will observe a ChunkReadTimeout from the object server and will raise
an exception which the wsgi service uses to terminate the connection to the Glance API server.
So far so good.
Remembering that all this is implemented using python iterations and generators, what we
observed is that the iterator in glance/api/v1/images.py:get_from_backend falls off the end when
the connection is terminated. We also observed that this leads to a situation whereby the Nova
node kept an open socket to the Glance API server which never went away, and the Nova operation
stalled. Updating the Glance API server side to detect a disconnect and raise an exception
caused the Glance server to disconnect the Nova client.

The following diff illustrates the change that was required. Inserting it at the get_from_backend()
level means that any underlying backend that disconnects is covered.

diff --git a/glance/api/v1/images.py b/glance/api/v1/images.py
index b80a164..3871389 100644
--- a/glance/api/v1/images.py
+++ b/glance/api/v1/images.py
@@ -19,6 +19,7 @@
/images endpoint for Glance v1 API
"""

+import errno
import httplib
import json
import logging
@@ -210,6 +211,24 @@ class Controller(api.BaseController):
         """
         image = self.get_active_image_meta_or_404(req, id)

+
+ def gen_from_store(image_data, image_meta):
+ bytes_transferred = 0
+ try:
+ for chunk in image_data:
+ yield chunk
+ bytes_transferred += len(chunk)
+ except Exception, e:
+ msg = ("Error getting image: %s") % str(e)
+ logger.error(msg)
+ raise
+ if image_meta['size'] != bytes_transferred:
+ logger.error('Corrupt (short) download of image %s' % image_meta['name'])
+ raise IOError(errno.EPIPE,'Corrupt image download')
+ else:
+ logger.debug('Intact download of image %s' % image['name'])
+
+
         def get_from_store(image_meta):
             """Called if caching disabled"""
             try:
@@ -218,7 +237,7 @@ class Controller(api.BaseController):
                 image_meta["size"] = image_size or image_meta["size"]
             except exception.NotFound, e:
                 raise HTTPNotFound(explanation="%s" % e)
- return image_data
+ return gen_from_store(image_data, image_meta)

         def get_from_cache(image, cache):
             """Called if cache hit"""

Note too that at the Nova end when a ‘short’ or ‘corrupt’ image is then received there should be
a check of the length of the received data AND probably the checksum. This could be in glance
client or Nova code or a combination thereof.

Jay Pipes (jaypipes) wrote :

I'm on it, Tom. Should be pushing a fix shortly based on your code above (has to change slightly because of recent updates, but mostly the same).

-jay

Changed in glance:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Jay Pipes (jaypipes)
milestone: none → essex-1
Jay Pipes (jaypipes) wrote :
Changed in glance:
status: Triaged → In Progress
Jay Pipes (jaypipes) wrote :

Ooops, ignore last comment, wrong bug...

Jay Pipes (jaypipes) wrote :

Hi Tom!

Would you mind pulling https://review.openstack.org/1191 and seeing if that fixes the issues in your tests?

Thanks!
-jay

summary: - glance server behaviour when a swift backend disconnects prematurely
+ Backend storage disconnections do not raise exceptions

Reviewed: https://review.openstack.org/1191
Committed: http://github.com/openstack/glance/commit/6970aa883aaeba8f37cec75f3e07499c57ff702a
Submitter: Jenkins
Branch: master

 status fixcommitted
 done

commit 6970aa883aaeba8f37cec75f3e07499c57ff702a
Author: Jay Pipes <email address hidden>
Date: Fri Oct 28 11:45:57 2011 -0400

    Fixes LP Bug#882585 - Backend storage disconnect

    Adds a checking iterator to wrap the main image_iter
    returned via the show() method on the images controller.
    This checked_iter validates that the image is fully
    transferred from the backend storage and raises an
    IOError if the chunked reads do not transfer the expected
    number of bytes.

    Change-Id: Icf9b303aa8c2c9c728581d59029fd2c29ddb6da5

Changed in glance:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2011-11-17
Changed in glance:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in glance:
milestone: essex-1 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers