Bug #1729705 “[RFE] Add Health information to node.” : Bugs : Ironic

Revision history for this message

Julia Kreger (juliaashleykreger) wrote on 2017-11-02:

#1

I really like the high level idea.

I would actually advocate two modes of operation. Support for self-set periodic tasks, and then an externally writable field. Perhaps this could be two fields instead? I guess this could be a config parameter that could allow an operator to externally supply information into the field and something like the ?management? interface could check it upon node.validate being called. If the overall health is bad, then reject action upon the node. Actually, it might be better to place in the boot interface, now that I think about it further. Worth consideration at least. From a more, stand-alone centric operator, an interface level periodic task could be better, but I guess what items that would be checked, and how it could be adapted/changed might be worth considering.

Part of this thinking, at least for myself, is why re-invent the wheel if there is an external monitoring tool that could just report into ironic, and then we could represent the health in many cases. Granted, we do want ironic to be the source of truth, but power state monitoring is already quite a bit of conductor overhead with larger deployments of nodes.

With IPMITOOL based systems, my additional worry would be uniformity based upon hardware as well, so then I wonder what that data would really be and what it would represent. I think the lowest cost thing is to provide the ability for loose external integration and the ability for scheduling onto a node be failed upon in the nova virt driver at some point via the validation interface. Periodic tasks, especially differing ones that may be ipmi, but slightly different due to vendor differences could be problematic to land in a consensus driven manor as well. That worry also causes me to think that loose integration with the ability to drive in other ways may be viable for users, without impacting scalability.

So information wise, would the node object field just be a high level "OK!" or would it be a dictionary with lots of information that could change? Would data over time be worth considering, or enough data to determine that there has been a delta? I guess it goes without saying this will definitely require a specification detailing what the MVP for health status monitoring and use patterns would be, perhaps also providing insight into what could be the future in varying style deployments.

I really like the high level idea.

I would actually advocate two modes of operation. Support for self-set periodic tasks, and then an externally writable field. Perhaps this could be two fields instead? I guess this could be a config parameter that could allow an operator to externally supply information into the field and something like the ?management? interface could check it upon node.validate being called. If the overall health is bad, then reject action upon the node. Actually, it might be better to place in the boot interface, now that I think about it further. Worth consideration at least.  From a more, stand-alone centric operator, an interface level periodic task could be better, but I guess what items that would be checked, and how it could be adapted/changed might be worth considering.

Part of this thinking, at least for myself, is why re-invent the wheel if there is an external monitoring tool that could just report into ironic, and then we could represent the health in many cases. Granted, we do want ironic to be the source of truth, but power state monitoring is already quite a bit of conductor overhead with larger deployments of nodes.

With IPMITOOL based systems, my additional worry would be uniformity based upon hardware as well, so then I wonder what that data would really be and what it would represent. I think the lowest cost thing is to provide the ability for loose external integration and the ability for scheduling onto a node be failed upon in the nova virt driver at some point via the validation interface. Periodic tasks, especially differing ones that may be ipmi, but slightly different due to vendor differences could be problematic to land in a consensus driven manor as well. That worry also causes me to think that loose integration with the ability to drive in other ways may be viable for users, without impacting scalability.

So information wise, would the node object field just be a high level "OK!" or would it be a dictionary with lots of information that could change? Would data over time be worth considering, or enough data to determine that there has been a delta? I guess it goes without saying this will definitely require a specification detailing what the MVP for health status monitoring and use patterns would be, perhaps also providing insight into what could be the future in varying style deployments.

Revision history for this message

Vladyslav Drok (vdrok) wrote on 2017-11-03:

#2

If this is about just adding a periodic task to query the ipmi controller just like the sensors we have, I'm ok with this, if you want to do what Julia describes I think this would need a spec.

Changed in ironic:
importance:	Undecided → Wishlist
status:	New → Confirmed

Julia Kreger (juliaashleykreger) on 2017-11-05

tags:

added: needs-spec

Ironic

[RFE] Add Health information to node.

Bug Description

Other bug subscribers

Remote bug watches