can not get nova version info on some public clouds, causing nova client hangs

Bug #1491579 reported by Monty Taylor on 2015-09-02
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
python-novaclient
Critical
Daniel Wallace

Bug Description

(Note: heavily edited now that fixes are posted to give complete picture)

novaclient v 2.27.0 introduces the idea of the CLI version negotiating with the server so that it will be able to show users only help and options for features that cloud they are running against actually supports based on microversion. This requires that the clouds support the version info GET call.

Much to the surprise of everyone in the development team, some times these API calls are blocked or hang. When testing against an upstream DevStack the following GETs all work (once you have a valid keystone token).

  GET / - returns all versions
  GET /v2 - returns a 300 redirecting to /v2/
  GET /v2/ - returns the version info for /v2/

Many clouds have apparently deployed in such a way that GET /v2 just hangs. The reasons why are opaque. The novaclient code was written to try that (which was probably a bug). Adding a slash reduces roundtrips by 1, and works on more clouds. Exactly why a bunch of clouds hang here is unknown.

As of this writing Rackspace and RunAbove return a 401 when trying to fetch /v2/ or /. That's coming out of they layer in front of Nova. That's definitely wrong behavior. This has now been directly reported back up to both orgs. Hopefully that will be addressed in the near future.

The fixes applied are as follows:

* https://review.openstack.org/#/c/220111 - this adds the extra '/' that works on more clouds.
* https://review.openstack.org/#/c/220192 - this adds 401 catching

We should and will ensure there are Tempest tests that express the expected behavior here, and will get those into defcore to ensure that we're communicating that this is something we believe clouds must expose to be OpenStack. This whole process has been an series of surprises that things were not working in the field.

Some initial data about the state of various clouds is here -
http://paste.openstack.org/show/442384/ - though it turns out that Monty has misconfigured Dreamhost credentials, so the Dreamhost data there is not valid.

(Note: This bug also exposed the fact that Nova developers don't have credentials to many if any public clouds. HP employees have access to their cloud, RAX employees to theirs (since RAX free cloud access for devs started expiring 6 months ago). There aren't many actively engaged developers from other public clouds. So getting information about how things like clients operate on deployed clouds depends on proxying through individuals like Monty who have personally opened up accounts on many OpenStack clouds with their own credit card to attempt to test compatibility.)

Also note, infra's used of HP and RAX for providers meant that it was discovered on those clouds very soon after the novaclient release (which is consumed by nodepool). And that this code working on both of those clouds was considered a critical bug to get infra working again. If other clouds contributed resources to infra, their cloud would also be considered a "must work in novaclient" cloud, and compat issues found and addressed early.

Matt Riedemann (mriedem) wrote :

The novaclient release version in question is 2.27.0.

Sean Dague (sdague) on 2015-09-02
Changed in nova:
importance: Undecided → Critical
status: New → Confirmed
milestone: none → liberty-rc1
Matt Riedemann (mriedem) on 2015-09-02
Changed in nova:
importance: Critical → High
Changed in python-novaclient:
status: New → Confirmed
no longer affects: nova
Changed in python-novaclient:
importance: Undecided → High
Sean Dague (sdague) on 2015-09-02
Changed in python-novaclient:
importance: High → Critical
Matt Riedemann (mriedem) on 2015-09-02
tags: added: liberty-rc-blocker
Matt Riedemann (mriedem) wrote :

(9:09:00 PM) SlickNik: hi folks — did something recently merge that disallows spaces in nova instance names?
(9:09:14 PM) openstackgerrit: Matt Riedemann proposed openstack/python-novaclient: Update path to subunit2html in post_test_hook https://review.openstack.org/219835
(9:10:04 PM) SlickNik: We're now getting this error when trying to do so:
(9:10:05 PM) mriedem: SlickNik: might be another symptom of https://github.com/openstack/nova/commit/4a18f7d3bafcdbede48500aac389e0a770b8e6a8
(9:10:11 PM) SlickNik: Returning 400 to user: Invalid input for field/attribute name. Value: TEST_2015-09-02 21:39:34.721645_config. u'TEST_2015-09-02 21:39:34.721645_config' does not match '^[a-zA-Z0-9-._]*$'

Matt Riedemann (mriedem) wrote :

Oops, comment 2 is for the wrong bug.

Fix proposed to branch: master
Review: https://review.openstack.org/220111

Changed in python-novaclient:
assignee: nobody → Sean Dague (sdague)
status: Confirmed → In Progress

The fact that multiple deployers are encountering the same issue tells me that it probably wasn't some intentional, malevolent decision on their part. Rather than publicly shaming people, can we work *constructively* to figure out the proper way for operators to resolve this?

Changed in python-novaclient:
assignee: Sean Dague (sdague) → Matt Riedemann (mriedem)

It's hard to know how to counsel operators in solving this, because it's the result of them filtering/blocking access to arbitrary methods in the Nova API. The easy answer is "stop trying to decide which parts of the Nova API are safe to expose" but I have a feeling that wouldn't be well-received. The ultimate answer is probably to get more methods added to the JSON files in http://git.openstack.org/cgit/openstack/defcore/tree/ and stop allowing providers to apply the OpenStack trademarks without complying.

Jonathan LaCour (cleverdevil) wrote :

Our team has been attempting to investigate this report, since we've been publicly called out, but are having a lot of trouble reproducing since rather than having a concise, clear bug report, we've got about 500 words of snark and name calling. I find this wholly inappropriate, especially from a member of the TC, and a long standing member of the community. Monty, you should know better, and I don't think we as a community should tolerate this kind of behavior. I fully expect that the hardworking members of the community from RAX, HP, RunAbove, and Auro feel the same.

Would someone who is able to reproduce this issue please create an appropriate ticket with background, steps to reproduce, expected behavior, etc? We've had no complaints from any customers, so its difficult to know how widespread an issue this is. But, if it is indeed an issue, we'd like to fix it ASAP.

Thanks.

Jonathan LaCour (cleverdevil) wrote :

Jeremy, with all due respect, threatening operators/providers with trademark is probably *not* the ultimate solution. The ultimate solution is to make OpenStack easy to deploy, manage, and a welcoming community for developers and operators alike. Oh, and to write appropriate, well-tested, repeatable bug reports to help each other resolve issues in a timely manner.

Changed in python-novaclient:
assignee: Matt Riedemann (mriedem) → Sean Dague (sdague)
Jeremy Stanley (fungi) wrote :

DefCore doesn't threaten anyone, it merely adds compliance tests and a future date when they're expected to pass to be able to qualify to continue using the related trademarks. It's also open to discussion/participation from all parts of the community including deployers/operators/providers to provide feedback on whether compliance with new requirements is reasonable or even feasible. It's the OpenStack community's answer to ensuring increasing levels of interoperability between independent deployments over time, and as an interoperability bug this is a good fit for such process.

Curtis Collicutt (6-curtis) wrote :

We (auro) didn't specifically decide to block this. We can fix it with a bit more information, but at any rate we'll certainly look into it.

Monty Taylor (mordred) wrote :

Hi Jonathan!

Steps are as follows:

source cloud-credentials.sh

virtualenv oldnovaclient
oldnovaclient/bin/pip install python-novaclient==2.26.0
oldnovaclient/bin/nova list

# You should see a list of the servers in your nova account. Win!

virtualenv newnovaclient
newnovaclient/bin/pip install python-novaclient==2.27.0
newnovaclient/bin/nova list

# If you are broken, you will either experience a 401 or an indefinite hang

HOWEVER

I made a mistake in my testing of all of the cloud providers (this is what you get for running tests by hand) and so my tests of dreamhost were not actually testing dreamhost.

Please accept my most humble apology. I will also be apologizing on the twitters, and anywhere else you'd like.

Sean Dague (sdague) wrote :

We've been working on various work arounds on this one, sorry for not getting back to the bug sooner with details.

The reproduce is as follows:

pip install python-novaclient==2.27.0
nova list
nova version-list

If both commands return fine, and quickly, you are probably fine. If "nova list" hangs, there is probably something unexpected in your load balancer.

There will be a 2.27.1 released shortly with some additional work arounds, as well as some Tempest tests inbound that should define pretty narrowly the expected behavior around the API version fetching (hopefully within a week). That we'll take to defcore in the next round of capability selection, to ensure that everyone is aware that it's expected for those bits to be accessible in the API, and that it will be easy to verify with refstack.

If you did get caught by the "nova list" hang, it would also be good to figure out where and why your environment has a configuration done that way, and if there are incorrect instructions upstream somewhere that might have caused that.

Sean Dague (sdague) on 2015-09-03
summary: - against all sanity, nova needs to work around broken public clouds
+ can not get nova version info on some public clouds, causing nova client
+ hangs
Sean Dague (sdague) on 2015-09-03
description: updated
Jonathan LaCour (cleverdevil) wrote :

Thanks, Sean, for cleaning up this bug report and making it useful! I'm glad to see that DreamHost isn't actually affected.

For what its worth, DreamHost is happy to provide a free account for validation and testing purposes if needed for this or any other issues in the future, provided that any issues are reported in appropriate fashion.

Mathieu Gagné (mgagne) wrote :

We managed to reproduce the problem in a matter that could explain why it's hanging.

Our public APIs are hosted behind HAProxy (and a firewall) which performs SSL termination.

With GET /v2 (on a SSL URL), the request gets redirect to a non-SSL URL (with the trailing slash added). However, as we do not expose in any way a non-SSL endpoint, this means our firewall will just drop any connection on tcp/80 port leaving the client hanging indefinitely.

Is there a way for nova-api to detect an upstream SSL termination and change the redirect URL accordingly?

Jesse Keating (jesse-keating) wrote :

This sounds like exactly the same thing Cinder ran into https://bugs.launchpad.net/python-cinderclient/+bug/1464160

We hit this with cinder, and now nova, due to our use of haproxy to terminate SSL. Nova thinks it isn't running with SSL so it hands out the wrong redirect.

To fix this, Cinder grew an option to define what it's own public URL would be, public_endpoint. http://docs.openstack.org/kilo/config-reference/content/cinder-conf-changes-kilo.html

Mathieu Gagné (mgagne) wrote :

I would much prefer a solution similar to Django:
https://docs.djangoproject.com/en/1.8/ref/settings/#secure-proxy-ssl-header

A special HTTP header (X-Forwarded-Proto) is used to tell the downstream application that upstream SSL termination is performed and that redirections should take that into account.

It offers much more flexibility where multiple DNS can point to the same Nova installation unlike the proposed public_endpoint config.

Jesse Keating (jesse-keating) wrote :

I would prefer that too. We're already sending that header from the proxy, it's up to Nova to do something appropriate with it.

Mathieu Gagné (mgagne) wrote :

What would it take for this approach to be used across the board in ALL OpenStack services?
Cinder, Glance and Keystone fixed it in similar manner (as I can see), why are other services not updated as well?

This is the kind of problems I would expect to trigger a cross-project effort to fix it in a consistent matter. Now we are left with no solution (to a known problem) because Nova got left behind due to (what a perceive) a lack of communication and cross-project collaboration.

Reviewed: https://review.openstack.org/220111
Committed: https://git.openstack.org/cgit/openstack/python-novaclient/commit/?id=c915757ed5e7ce3714eb9f0682d24530112f735e
Submitter: Jenkins
Branch: master

commit c915757ed5e7ce3714eb9f0682d24530112f735e
Author: Sean Dague <email address hidden>
Date: Thu Sep 3 08:19:10 2015 -0400

    Don't assume oscomputeversions is correctly deployed

    The paste pipeline that's "above" the /v2/ and /v2.1/ urls is the
    oscomputeversions dispatcher, which should provide 300 redirects to
    versions in question. Previously the code that was trying to discover
    information about the current version did so via a GET of something
    that looked like /v2. On an upstream compatible OpenStack this
    actually triggers a 300 and redirect to /v2/ for the real answer.

    On deployed public clouds, oscomputeversions is sometimes not deployed
    at all, or front end blocking by the load balancer does
    weirdness. Sometimes with a 401, sometimes a hang.

    This fixes the way that first request is made that should avoid the
    300 redirect entirely. The code probably should have done this
    originally, but it was tested with upstream configuration that worked
    fine in this case.

    This fix is tested against HP Cloud (the only pub cloud I have creds
    with). Before this fix "nova list" hangs trying to get the supported
    versions, after the fix it works as expected.

    Change-Id: I1692380fe8d340e5c044f46dd0b103c7550d2c7d
    Closes-Bug: #1491579

Changed in python-novaclient:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/220192
Committed: https://git.openstack.org/cgit/openstack/python-novaclient/commit/?id=e4b0d46c4b5b99973a7f65c294a9b73c8adfefb7
Submitter: Jenkins
Branch: master

commit e4b0d46c4b5b99973a7f65c294a9b73c8adfefb7
Author: Sean Dague <email address hidden>
Date: Thu Sep 3 11:13:21 2015 -0400

    workaround for RAX repose configuration

    The RAX Repose environment is blocking access to the API version
    information by local policy, returning a 401 before we even get to
    Nova itself. While this in clearly incorrect behavior, it is behavior
    in the field, and we should not break all our users on that cloud.

    This catches the 401 and translates that so that it's the equivalent
    of only supporting v2.0. We will be taking these API calls to defcore,
    and intend to remove this work around once that is done.

    Change-Id: I2072095c24b41efcfd58d6f25205bcc94f1174da
    Related-Bug: #1491579

Changed in python-novaclient:
milestone: none → 2.28.0
status: Fix Committed → Fix Released
Sean Dague (sdague) wrote :

@mathieu, @jesse - Nova has the ability to set that manually - https://github.com/openstack/nova/blob/c4b2cd90f84c0fb4d2f0dcbf82fc9a2225e8ce56/nova/api/openstack/common.py#L47-L49 which does need to be done if you are SSL terminating, otherwise the redirection fails. That at least explains the /v2 bounce problem.

Long term, I think the right thing to do is have all the services reflect their entries from the service catalog in their redirection documents. This requires standardization efforts around service catalog content. It was already on my agenda for next cycle, but this bug demonstrates that it's important to be done for additional reasons.

Mathieu Gagné (mgagne) wrote :

@sdague The suggested configuration does not affect URLs in redirections.

melanie witt (melwitt) wrote :

Re-opening because the patch for the RAX workaround [1] only handles the Unauthorized response for the keystone session client and not the plain http client, so those not using keystone are still hitting the problem.

[1] https://review.openstack.org/220192

Changed in python-novaclient:
milestone: 2.28.0 → none
status: Fix Released → Confirmed
Changed in python-novaclient:
assignee: Sean Dague (sdague) → Daniel Wallace (danielwallace)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/221570
Committed: https://git.openstack.org/cgit/openstack/python-novaclient/commit/?id=b2221c8a7020ac66197360f594ba58b27434cf7d
Submitter: Jenkins
Branch: master

commit b2221c8a7020ac66197360f594ba58b27434cf7d
Author: Daniel Wallace <email address hidden>
Date: Tue Sep 8 20:13:39 2015 -0500

    Fix bugs with rackspace

    not all apis have the versions available

    Rackspace api does not open up for querying versions. This causes it to
    still break when using the rackspace-auth-plugin

    add tests for api_version Unauthorized

    make sure that the list can still run even if the api_version check is
    unauthorized.

    Closes-Bug: #1493974
    Closes-Bug: #1491579
    Change-Id: I038b84bad5b747a0688aef989f1337aee835b945

Changed in python-novaclient:
status: In Progress → Fix Committed
Changed in python-novaclient:
milestone: none → 2.29.0
status: Fix Committed → Fix Released
Mathieu Gagné (mgagne) wrote :

For those reading this bug and looking for a way to add support for the "X-Forwarded-Proto" in Nova, the secure_proxy_ssl_header config is what you are looking for. It got merged in https://review.openstack.org/#/c/206479/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers