[RFE] Add Health information to node.

Bug #1729705 reported by Chris Krelle
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Confirmed
Wishlist
Unassigned

Bug Description

I would like to propose adding health information to the Node object.

IPMITOOL already supports reporting of sensor and health information. By adding health information to the node object external monitoring / scheduling systems would be able to detect if the node is healthy before deploying.

Collection of health data can be accomplished with a Periodic task.

I see health information as being read-only and set only via the Periodic task.

Future revisions could automatically place nodes with detected failures in maintenance mode, but this not planed for the initial implatimation of this feature.

Tags: needs-spec rfe
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

I really like the high level idea.

I would actually advocate two modes of operation. Support for self-set periodic tasks, and then an externally writable field. Perhaps this could be two fields instead? I guess this could be a config parameter that could allow an operator to externally supply information into the field and something like the ?management? interface could check it upon node.validate being called. If the overall health is bad, then reject action upon the node. Actually, it might be better to place in the boot interface, now that I think about it further. Worth consideration at least. From a more, stand-alone centric operator, an interface level periodic task could be better, but I guess what items that would be checked, and how it could be adapted/changed might be worth considering.

Part of this thinking, at least for myself, is why re-invent the wheel if there is an external monitoring tool that could just report into ironic, and then we could represent the health in many cases. Granted, we do want ironic to be the source of truth, but power state monitoring is already quite a bit of conductor overhead with larger deployments of nodes.

With IPMITOOL based systems, my additional worry would be uniformity based upon hardware as well, so then I wonder what that data would really be and what it would represent. I think the lowest cost thing is to provide the ability for loose external integration and the ability for scheduling onto a node be failed upon in the nova virt driver at some point via the validation interface. Periodic tasks, especially differing ones that may be ipmi, but slightly different due to vendor differences could be problematic to land in a consensus driven manor as well. That worry also causes me to think that loose integration with the ability to drive in other ways may be viable for users, without impacting scalability.

So information wise, would the node object field just be a high level "OK!" or would it be a dictionary with lots of information that could change? Would data over time be worth considering, or enough data to determine that there has been a delta? I guess it goes without saying this will definitely require a specification detailing what the MVP for health status monitoring and use patterns would be, perhaps also providing insight into what could be the future in varying style deployments.

Revision history for this message
Vladyslav Drok (vdrok) wrote :

If this is about just adding a periodic task to query the ipmi controller just like the sensors we have, I'm ok with this, if you want to do what Julia describes I think this would need a spec.

Changed in ironic:
importance: Undecided → Wishlist
status: New → Confirmed
tags: added: needs-spec
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.