Opinion: o-s-c update defaults for nova service checks

Bug #1876106 reported by Peter Sabaini
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-openstack-service-checks
Won't Fix
Wishlist
Unassigned

Bug Description

I'm growing more and more wary about the defaults for the nova service checks in o-s-c.

Checking for number of units in a host aggregate (nova_warn and nova_crit options) strikes me as non-actionable. The number of hosts in an aggregate is up to the customer, and there are valid use cases for having 0 hosts in an aggregate, e.g. using a hostagg as a spare pool of machines.

Similarly, having hosts disabled in the compute service list is a valid use case, e.g. for having hosts in hardware maintenance, and alerting about this is of little value.

Therefore I'd propose to change the defaults:

nova_warn, nova_crit = -1 # disable
skip-disabled = True # skip the disabled compute check by default

Alvaro Uria (aluria)
Changed in charm-openstack-service-checks:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Alvaro Uria (aluria) wrote :

Hi! There is another Juju config parameter (skipped_host_aggregates) which is empty by default but could list (comma-separated) as many aggregate hosts as you don't want to monitor for nova_warn and nova_crit thresholds.

The rationale behind monitoring aggregate hosts is to avoid running out of resources due to hardware issues. Setting -1 by default would essentially make those options ignored. When a managed service is not in sync with customer operations, it makes sense to disable them (as you said, there's no action to be taken by the undercloud operators. Besides, false positive may occur as HA overcloud services may not be implemented).

Similarly, skip-disabled (triggers a warning when flag is disabled) was implemented to avoid missing hardware that's been too long out of service. Making it the default would also be as not having such option implemented. Alternative means to track the list of nodes out of service may sound like a better option, though. In this case, I agree skip-disabled default value should be True.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Hey Alvaro,

I get why those checks are there it's just that I'm finding them less useful than I had hoped :-)

Wrt to monitoring hostaggs nodecount IMHO one of the problems with this is also that resource tracking at this level is rather coarse, and Grafana/Prom do a better job of gauging capacity.

Wrt to the skip-disabled warning, I feel like since it's non-actionable as well (on most clouds there is some hardware in maintenance some of the time) it soon falls prey to alert fatigue, respectively helps build it too.

cheers,
peter.

Revision history for this message
Eric Chen (eric-chen) wrote :

This charm is no longer being actively maintained. Please consider using the new Canonical Observability Stack instead.
(https://charmhub.io/topics/canonical-observability-stack)
I will close this feature request

Changed in charm-openstack-service-checks:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.