Ironic

conductor hash ring is not updated as conductors arrive and leave

Bug #1355510 reported by Alex Weeks on 2014-08-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Fix Released	High	aeva black	Ironic 2014.2 "juno"

Bug Description

Upon starting a conductor, a HashRingManager is instantiated to allocate nodes to conductor instances:

https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L199

When using multiple conductor instances the HashRing should be updated over time as each conductor become active in order to allocate nodes evenly. Unfortunately, this does not happen, because HashRing.Manager_ensure_rings_fresh() only updates the HashRing once, when first called:

https://github.com/openstack/ironic/blob/master/ironic/common/hash_ring.py#L129

As a result, the n-th conductor that is started in turn believes that it is responsible for 1/n of the nodes, and never becomes aware of other members of the HashRing. In practice, this does not cause direct problems as locking is used for operations where exclusive control of a node is required, but it does result in extra work being performed.

However, should a conductor fail, each conductor's view of the HashRing is not updated, and therefore it is possible for some nodes to not be owned by any conductor.

I believe that adding time-based invalidation to HashRingManager._ensure_rings_fresh() will mitigate the problem (see https://github.com/openstack/ironic/blob/master/ironic/common/hash_ring.py#L131)

See original description

Alex Weeks (alex-weeks) on 2014-08-11

description:	updated
description:	updated

Revision history for this message

aeva black (tenbrae) wrote on 2014-08-12:

Requests are routed by the API service to conductors based on the HashRing at that point in time -- API services do not currently cache the HashRing at all, and regenerate it on demand. This results in all actions (power or provision state changes) going to the appropraite conductor.

Where this bug manifests is in automatic recovery from conductor failure. Any environmental configuration, such as tftp configuration, or a serial console session, need to be re-created during fail-over. This is not triggered today.

Changed in ironic:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → next

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2014-08-20:

Also, any looping jobs that work on a Conductor are impacted by this job.

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2014-08-20:

er, are impacted by this bug

OpenStack Infra (hudson-openstack) on 2014-08-21

Changed in ironic:
assignee:	nobody → Gregory Haynes (greghaynes)
status:	Triaged → In Progress

aeva black (tenbrae) on 2014-09-08

Changed in ironic:
milestone:	next → juno-rc1

aeva black (tenbrae) on 2014-09-23

summary:

- conductor hash ring is not updated as nodes arrive and leave
+ conductor hash ring is not updated as conductors arrive and leave

aeva black (tenbrae) on 2014-09-26

Changed in ironic:
assignee:	Gregory Haynes (greghaynes) → Devananda van der Veen (devananda)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-27: Change abandoned on ironic (master)

Change abandoned by greghaynes (<email address hidden>) on branch: master
Review: https://review.openstack.org/109688

aeva black (tenbrae) on 2014-10-03

Changed in ironic:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-10-03

Changed in ironic:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2014-10-16

Changed in ironic:
milestone:	juno-rc1 → 2014.2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.