Uploading and downloading VHDs via Glance XenAPI plugin doesn't always retry when it should

Bug #1380776 reported by Jesse J. Cook
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Jesse J. Cook

Bug Description

Encountered a situation where one glance node could not talk to registry which resulted in a high number of upload_vhd errors. The Glance XenAPI plugin doesn't properly differentiate between server permanent and globally permanent errors. This is only reasonable behavior in the case where there is a single glance node. In the case of many glance nodes retrying a different server is preferable.

Ideally:

Retry until:
1. A non-retryable error is encountered (e.g. 403)
2. Max retries is reached
3. No servers left to retry (i.e. every server was dropped from the retry list due to a permanent error)

If the glance nodes sit behind a load balancer (proxy), this approach could result in the LB being treated as a single glance endpoint (no retries for server errors). Retrying on server errors without dropping servers with server errors from the list could result in unnecessary retries, especially in the case where there is only a single glance node.

Additionally, if multiple errors are encountered, only the last error is logged as an instance error. Every error should be recorded.

Examples:

Current:

* The plugin tries to upload using 1 of n glance nodes (n > 1)
* An ephemeral (retryable) error is encountered
* The plugin retries using a different glance node
* An error related to a server fault (e.g. 500) is encountered
* The plugin does not retry
* Instance fault

Expected:

* The plugin tries to upload using 1 of n glance nodes (n > 1)
* An ephemeral (retryable) error is encountered
* Instance fault
* The plugin retries using a different glance node
* An error related to a server fault (e.g. 500) is encountered
* The plugin retries using a different glance node
* Success

Changed in nova:
assignee: nobody → Jesse J. Cook (jesse-j-cook)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/128090

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/129327

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/129327
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=632a034c9a12951eecea10346cc52b3135c5ce1f
Submitter: Jenkins
Branch: master

commit 632a034c9a12951eecea10346cc52b3135c5ce1f
Author: Jesse J. Cook <email address hidden>
Date: Fri Oct 17 11:44:58 2014 -0500

    xenapi: upload/download params consistency change

    This patch updates the params for upload_image and download_image to be
    consistent.

    Change-Id: Ie6dbdb096624fb06a4fc29a461c3569e48df3ec0
    Partial-Bug: 1380776

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/128090
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cc2a34ce059d235e8da066049f72b6b427a83978
Submitter: Jenkins
Branch: master

commit cc2a34ce059d235e8da066049f72b6b427a83978
Author: Jesse J. Cook <email address hidden>
Date: Mon Oct 13 14:08:52 2014 -0500

    update retryable errors & instance fault on retry

    HTTP errors can be split into a few categories: client ephemeral, server
    ephemeral, server permanent, and globally permanent. You could argue
    there is even more permutations. However, for simplicity, these errors
    can be viewed as ephemeral and permanent.

    The Glance XenAPI plugin has been updated to raise a RetryableError for
    ephemeral and unexpected (i.e. errors that are categorized as neither
    permanent or ephemeral) errors.

    Additionally, an instance fault will be logged every time an error
    occurs. This will serve as transaction history where every attempt is a
    transaction on the state. If an ephemeral error occurs, there is a
    retry, then a permanent error, the history of each error will be in the
    instance_faults.

    Deployers should configure the num_retries relative to the number of
    api_servers. Right now, servers are put in random order and then cycled.
    Trying the same server multiple times could cause unnecessary load and
    delays. However, trying multiple servers, is ideal when a single server
    is behaving badly or cannot reach another server it needs to communicate
    with to fulfill a request.

    Closes-Bug: 1380776

    Change-Id: I267a5b524c3ff8a28edf1a2285b77bb09049773c

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.