Autoscaling doesn't detect failed instance creation

Bug #1192125 reported by Steven Hardy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Thomas Herve

Bug Description

When nova fails to create an instance, autoscaling doesn't propagate the error, it adds the instance to the instance list in the DB anyway, so you have no chance to retry creating the instance (e.g by triggering another scaling event/alarm).

So our instance list in the DB is false, and inconsistent with what actually exists in nova.

This also creates bad entries in the haproxy.cfg on the loadbalancer:

        backend servers
            balance roundrobin
            option http-server-close
            option forwardfor
            option httpchk
            timeout check 5s
            server server1 0.0.0.0:80 check inter 30s fall 5 rise 3
            server server2 0.0.0.0:80 check inter 30s fall 5 rise 3
            server server3 0.0.0.0:80 check inter 30s fall 5 rise 3

which obviously isn't going to work (we actually create a bad IP for the first server too, but I'll raise a separate bug for that)

Here's an example in engine.log of things going wrong - note the nova error, then we go ahead and update the LoadBalancer anyway...

2013-06-18 11:50:48.158 2992 ERROR heat.engine.resource [-] CREATE : GroupedInstance "WebServerGroup-1"
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource Traceback (most recent call last):
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 333, in _do_action
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource handle_data = handle()
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/instance.py", line 299, in handle_create
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource image_id = self._get_image_id(image_name)
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/instance.py", line 482, in _get_image_id
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource image_list = self.nova().images.list()
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/v1_1/images.py", line 55, in list
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource return self._list('/images%s%s' % (detail, query), 'images')
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 62, in _list
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource _resp, body = self.api.client.get(url)
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 230, in get
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource return self._cs_request(url, 'GET', **kwargs)
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 217, in _cs_request
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource **kwargs)
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 199, in _time_request
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource resp, body = self.request(url, method, **kwargs)
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 193, in request
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource raise exceptions.from_response(resp, body, url, method)
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource ClientException: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-0ea6041d-3f4f-4071-b2c9-d9d84d0eb2fe)
2013-06-18 11:50:48.158 2992 TRACE heat.engine.resource·
2013-06-18 11:50:48.206 2992 DEBUG heat.engine.scheduler [-] Task <functools.partial object at 0x3448578> complete step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:163
2013-06-18 11:50:48.217 2992 DEBUG heat.engine.scheduler [-] Task _scale from AutoScalingGroup "WebServerGroup" running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:157
2013-06-18 11:50:48.233 2992 INFO heat.engine.resource [-] updating LoadBalancer "ElasticLoadBalancer"

Steven Hardy (shardy)
Changed in heat:
importance: Undecided → Critical
status: New → Triaged
Steven Hardy (shardy)
Changed in heat:
milestone: none → havana-2
Revision history for this message
Zane Bitter (zaneb) wrote :

Assuming this is occurring on a scaling event, rather than stack create/update, we have always suppressed these errors.

Steven Hardy (shardy)
Changed in heat:
assignee: nobody → Thomas Herve (therve)
Steven Hardy (shardy)
Changed in heat:
importance: Critical → High
Revision history for this message
Steven Hardy (shardy) wrote :

>Assuming this is occurring on a scaling event, rather than stack create/update, we have always suppressed these errors.

May well be the case, doesn't make it any less wrong, but reduced severity to high ;)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33602

Changed in heat:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/33602
Committed: http://github.com/openstack/heat/commit/cb5dfdf786619ea1d04467774e07c0c4b231ed4d
Submitter: Jenkins
Branch: master

commit cb5dfdf786619ea1d04467774e07c0c4b231ed4d
Author: Thomas Herve <email address hidden>
Date: Wed Jun 19 09:17:48 2013 +0200

    Detect failed instance creation in autoscaling

    Wait for instances to be created sucessfully before adding them to the
    list in the Autoscaling resource.

    Fixes: bug #1192125
    Change-Id: Ie8676c23de5a62d3b8b2b4088b67d249ae90ceef

Changed in heat:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.