[RFE] Take a node out of service if no active conductors supports the node's driver

Bug #1526735 reported by Vladyslav Drok
4
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Confirmed
Wishlist
Unassigned

Bug Description

First the API won't allow clients to register a node with an invalid driver (non-existent or not present in any of the active conductors) , but conductors could go offline at some point making nodes to become invalid, the intention of this blueprint is to make sure that all nodes registered with an invalid driver get's marked as out-of-service.

Marking a node as out-of-service also should remove the node from the scheduler immediately to avoid a retry-fail loop[1].

Here's two ideas for marking the node as out-of-service:

1 (Simpler) - Having a periodic task that get's a list of active drivers and interact trough the list of registered nodes checking if the drivers of the nodes are are still valid.

2 - The consistent hashing algorithm[2] maps conductors to nodes considering the node's driver and the list of driver that each active conductor have, the algorithm is also responsible for maintaining a list of dead conductors as well, every time a conductor goes offline it should trigger a task that would first check if the drivers that the dead conductor had is not present in any other active conductor, in case the driver is not present any more it should fetch a list of nodes that needs such drivers and mark them as out-of-service.

[1] https://bugs.launchpad.net/ironic/+bug/1260099
[2] https://blueprints.launchpad.net/ironic/+spec/instance-mapping-by-consistent-hash

Vladyslav Drok (vdrok)
Changed in ironic:
status: New → Confirmed
importance: Undecided → Wishlist
tags: added: rfe
Revision history for this message
Vladyslav Drok (vdrok) wrote :

Copy of whiteboard:

"the API won't allow clients to register a node with an invalid driver"
-- I tested this today, and the API still allowed it. So I have filed this review to fix it:
   https://review.openstack.org/68018

I see a problem with both your proposed solutions.
[1] where does this periodic_task run? If it runs on all conductors, which one decides what nodes to mark offline?
[2] again, which surviving conductor is responsible for marking the nodes-now-owned-by-no-one as dead?

Take the extreme case -- what if all conductors are offline. Thus all nodes are unavailable, since the hash won't map any node to any where (there will be no drivers in the ring, right?).

I do not think this should be "conductor marks a node inactive in the database". Instead, I think we need to:
1) ensure that the nova driver only gets a list of actually-available nodes, and will remove no-longer-available nodes from its list, during each cycle where it refreshes the view of available resources
2) gracefully handle requests to the API to manage nodes which no longer have any active conductor.

I think that patch https://review.openstack.org/68018 goes to some degree to handle (2), but it may need more work. I suspect we have the means already (or most of it) to do (1) as well, but not sure if that's in the Nova driver or not.

Just my thoughts,
Devananda 2014-01-20

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

Sounds like this might need a spec.

tags: added: needs-spec
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers