Opinion: o-s-c update defaults for nova service checks
Bug #1876106 reported by
Peter Sabaini
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
charm-openstack-service-checks |
Won't Fix
|
Wishlist
|
Unassigned |
Bug Description
I'm growing more and more wary about the defaults for the nova service checks in o-s-c.
Checking for number of units in a host aggregate (nova_warn and nova_crit options) strikes me as non-actionable. The number of hosts in an aggregate is up to the customer, and there are valid use cases for having 0 hosts in an aggregate, e.g. using a hostagg as a spare pool of machines.
Similarly, having hosts disabled in the compute service list is a valid use case, e.g. for having hosts in hardware maintenance, and alerting about this is of little value.
Therefore I'd propose to change the defaults:
nova_warn, nova_crit = -1 # disable
skip-disabled = True # skip the disabled compute check by default
Changed in charm-openstack-service-checks: | |
status: | New → Triaged |
importance: | Undecided → Wishlist |
To post a comment you must log in.
Hi! There is another Juju config parameter (skipped_ host_aggregates ) which is empty by default but could list (comma-separated) as many aggregate hosts as you don't want to monitor for nova_warn and nova_crit thresholds.
The rationale behind monitoring aggregate hosts is to avoid running out of resources due to hardware issues. Setting -1 by default would essentially make those options ignored. When a managed service is not in sync with customer operations, it makes sense to disable them (as you said, there's no action to be taken by the undercloud operators. Besides, false positive may occur as HA overcloud services may not be implemented).
Similarly, skip-disabled (triggers a warning when flag is disabled) was implemented to avoid missing hardware that's been too long out of service. Making it the default would also be as not having such option implemented. Alternative means to track the list of nodes out of service may sound like a better option, though. In this case, I agree skip-disabled default value should be True.