nodepool fails to startup when a cloud endpoint is offline

Bug #1281319 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
Fix Committed
High
James E. Blair

Bug Description

This makes using cloud endpoints that aren't always up a problem, because nodepool restarts can take it down permanently until reconfigured.

Tags: nodepool
James E. Blair (corvus)
Changed in openstack-ci:
status: New → Triaged
importance: Undecided → High
tags: added: nodepool
Revision history for this message
Jeremy Stanley (fungi) wrote :

We've plugged at least one of these (exception thrown during image listing calls to the provider during scheduled rebuilds), but there are definitely still more. Here's an exception seen when reconfigure managers was failing due to a provider outage...

2014-02-04 00:00:46,884 ERROR nodepool.NodePool: Exception in main loop:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nodepool/nodepool.py", line 995, in run
    self.reconfigureManagers(config)
  File "/usr/local/lib/python2.7/dist-packages/nodepool/nodepool.py", line 758, in reconfigureManagers
    provider_manager.ProviderManager(p)
  File "/usr/local/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 208, in __init__
    self._extensions = self._getExtensions()
  File "/usr/local/lib/python2.7/dist-packages/nodepool/provider_manager.py", line 239, in _getExtensions
    resp, body = self._client.client.get('/extensions')
  File "/usr/local/lib/python2.7/dist-packages/novaclient/client.py", line 229, in get
    return self._cs_request(url, 'GET', **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/novaclient/client.py", line 202, in _cs_request
    self.authenticate()
  File "/usr/local/lib/python2.7/dist-packages/novaclient/client.py", line 329, in authenticate
    auth_url = self._v2_auth(auth_url)
  File "/usr/local/lib/python2.7/dist-packages/novaclient/client.py", line 411, in _v2_auth
    return self._authenticate(url, body)
  File "/usr/local/lib/python2.7/dist-packages/novaclient/client.py", line 423, in _authenticate
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/novaclient/client.py", line 195, in _time_request
    resp, body = self.request(url, method, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/novaclient/client.py", line 166, in request
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 335, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 438, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 327, in send
    raise ConnectionError(e)
ConnectionError: HTTPSConnectionPool(host='ci-overcloud.tripleo.org', port=13000): Max retries exceeded with url: /v2.0/tokens (Caused by <class 'socket.error'>: [Errno 110] Connection timed out)

Derek Higgins (derekh)
Changed in openstack-ci:
assignee: nobody → Derek Higgins (derekh)
Changed in openstack-ci:
assignee: Derek Higgins (derekh) → James E. Blair (corvus)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nodepool (master)

Reviewed: https://review.openstack.org/74521
Committed: https://git.openstack.org/cgit/openstack-infra/nodepool/commit/?id=93689532297efe792ddd2e79930b46c5aaa82fb1
Submitter: Jenkins
Branch: master

commit 93689532297efe792ddd2e79930b46c5aaa82fb1
Author: Robert Collins <email address hidden>
Date: Wed Feb 19 11:32:28 2014 +1300

    Make nodepool more robust to offline clouds.

    When a cloud is offline we cannot query it's flavors or extensions,
    and without those we cannot use a provider manager. For these
    attributes making the properties that lazy-initialize will fix the
    problem (we may make multiple queries, but it is idempotent so
    locking is not needed).

    Callers that trigger flavor or extension lookups have to be able to
    cope with a failure propogating up - I've manually found all the
    places I think.

    The catchall in _getFlavors would mask the problem and lead to
    the manager being incorrectly initialized, so I have removed that.

    Startup will no longer trigger cloud connections in the main thread,
    it will all be deferred to worker threads such as ImageUpdate,
    periodic check etc.

    Additionally I've added some belts-and-braces catches to the two
    key methods - launchImage and updateImage which while they don't
    directly interact with a provider manager do access the provider
    definition, which I think can lead to occasional skew between the
    DB and the configuration - I'm not /sure/ they are needed, but
    I'd rather be safe.

    Change-Id: I7e8e16d5d4266c9424e4c27ebcc36ed7738bc86f
    Fixes-Bug: #1281319

Changed in openstack-ci:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.