conductor hash ring is not updated as conductors arrive and leave
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
Fix Released
|
High
|
aeva black |
Bug Description
Upon starting a conductor, a HashRingManager is instantiated to allocate nodes to conductor instances:
https:/
When using multiple conductor instances the HashRing should be updated over time as each conductor become active in order to allocate nodes evenly. Unfortunately, this does not happen, because HashRing.
https:/
As a result, the n-th conductor that is started in turn believes that it is responsible for 1/n of the nodes, and never becomes aware of other members of the HashRing. In practice, this does not cause direct problems as locking is used for operations where exclusive control of a node is required, but it does result in extra work being performed.
However, should a conductor fail, each conductor's view of the HashRing is not updated, and therefore it is possible for some nodes to not be owned by any conductor.
I believe that adding time-based invalidation to HashRingManager
description: | updated |
description: | updated |
Changed in ironic: | |
assignee: | nobody → Gregory Haynes (greghaynes) |
status: | Triaged → In Progress |
Changed in ironic: | |
milestone: | next → juno-rc1 |
summary: |
- conductor hash ring is not updated as nodes arrive and leave + conductor hash ring is not updated as conductors arrive and leave |
Changed in ironic: | |
assignee: | Gregory Haynes (greghaynes) → Devananda van der Veen (devananda) |
Changed in ironic: | |
status: | In Progress → Fix Committed |
Changed in ironic: | |
status: | Fix Committed → Fix Released |
Changed in ironic: | |
milestone: | juno-rc1 → 2014.2 |
Requests are routed by the API service to conductors based on the HashRing at that point in time -- API services do not currently cache the HashRing at all, and regenerate it on demand. This results in all actions (power or provision state changes) going to the appropraite conductor.
Where this bug manifests is in automatic recovery from conductor failure. Any environmental configuration, such as tftp configuration, or a serial console session, need to be re-created during fail-over. This is not triggered today.