Instance groups should intelligently downsize based on reality
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Heat |
Triaged
|
Medium
|
Anant Patil |
Bug Description
If I have a template making use of instance groups (auto scaled or not), when I reduce the count, I would expect some resources to be deleted.
However, if I have already deleted some of those manually, I would not expect an arbitrary resource to be deleted, but for Heat to just check the resources and remove reference to the already removed resource.
This is related to convergence, but it solves a chicken/egg problem with convergence. In convergence, the stack definition will be compared to reality, and anything missing from reality will be handled. But that only works for resurrecting dead resources. In this case, if I update the stack with a lower count it will delete the newest resources before I've had a chance to run convergence.
Changed in heat: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in heat: | |
assignee: | nobody → Anant Patil (ananta) |
tags: | added: autoscaling fixed-by-convergence |
Changed in heat: | |
milestone: | none → no-priority-tag-bugs |
The problem is that Heat is not aware of any out-of-band events. Periodically polling nova to retrieve instance status is not a good idea. So I have tried to detect this kind of events using ceilometer. Actually, instance lifecycle events are reported upwards from the hypervisor to nova, and nova is generating notifications that can be received by ceilometer. The only missing link is to have ceilometer to notify Heat about these events, e.g. compute. instance. delete. end.
There are two barriers to get this done:
1. Ceilometer support to alarms based on events (rather than meters). There has been work on this, but abandoned. I'm trying to resurrect that work.
2. This kind of alarms are not expected by Heat, thus Heat won't subscribe to them. When this event does happen, we need a proper authentication to invoke resource_signal, because we don't have alarm_url, and we don't have domain user credentials.