conductor hash ring is not updated as conductors arrive and leave

Bug #1355510 reported by Alex Weeks
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
aeva black

Bug Description

Upon starting a conductor, a HashRingManager is instantiated to allocate nodes to conductor instances:

https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L199

When using multiple conductor instances the HashRing should be updated over time as each conductor become active in order to allocate nodes evenly. Unfortunately, this does not happen, because HashRing.Manager_ensure_rings_fresh() only updates the HashRing once, when first called:

https://github.com/openstack/ironic/blob/master/ironic/common/hash_ring.py#L129

As a result, the n-th conductor that is started in turn believes that it is responsible for 1/n of the nodes, and never becomes aware of other members of the HashRing. In practice, this does not cause direct problems as locking is used for operations where exclusive control of a node is required, but it does result in extra work being performed.

However, should a conductor fail, each conductor's view of the HashRing is not updated, and therefore it is possible for some nodes to not be owned by any conductor.

I believe that adding time-based invalidation to HashRingManager._ensure_rings_fresh() will mitigate the problem (see https://github.com/openstack/ironic/blob/master/ironic/common/hash_ring.py#L131)

Alex Weeks (alex-weeks)
description: updated
description: updated
Revision history for this message
aeva black (tenbrae) wrote :

Requests are routed by the API service to conductors based on the HashRing at that point in time -- API services do not currently cache the HashRing at all, and regenerate it on demand. This results in all actions (power or provision state changes) going to the appropraite conductor.

Where this bug manifests is in automatic recovery from conductor failure. Any environmental configuration, such as tftp configuration, or a serial console session, need to be re-created during fail-over. This is not triggered today.

Changed in ironic:
status: New → Triaged
importance: Undecided → High
milestone: none → next
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Also, any looping jobs that work on a Conductor are impacted by this job.

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

er, are impacted by this bug

Changed in ironic:
assignee: nobody → Gregory Haynes (greghaynes)
status: Triaged → In Progress
aeva black (tenbrae)
Changed in ironic:
milestone: next → juno-rc1
aeva black (tenbrae)
summary: - conductor hash ring is not updated as nodes arrive and leave
+ conductor hash ring is not updated as conductors arrive and leave
aeva black (tenbrae)
Changed in ironic:
assignee: Gregory Haynes (greghaynes) → Devananda van der Veen (devananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by greghaynes (<email address hidden>) on branch: master
Review: https://review.openstack.org/109688

aeva black (tenbrae)
Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.