Openstack Integrator should have nrpe checks that monitor status of openstack components supporting k8s workloads

Bug #1853886 reported by Drew Freiberger
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Openstack Integrator Charm
Won't Fix
Wishlist
Unassigned
charm-openstack-service-checks
Fix Released
Medium
Robert Gildein

Bug Description

As is mentioned in lp#1853668, it is possible that there can be issues on the backend of openstack underlay that can cause odd/failing service access for kubernetes workloads.

The openstack integrator charm should have monitoring hooks added for nrpe-external-master that provide checks of loadbalancer/FIP/networks/etc, anything that is manged by the openstack-integrator, on behalf of kubernetes to ensure that any openstack components configured by the integrator are monitored and status reported via nagios.

For instance, if there is a loadbalancer that is running in support of a service endpoint, it's status and its loadbalancer pool member statuses should be monitored and reported up to kubernetes and/or nagios in some way that can be exposed to operators of multi-tiered clouds.

George Kraft (cynerva)
Changed in charm-openstack-integrator:
importance: Undecided → Wishlist
status: New → Triaged
Changed in charm-openstack-integrator:
assignee: nobody → Robert Gildein (rgildein)
status: Triaged → In Progress
Revision history for this message
Robert Gildein (rgildein) wrote :

I think the best approach to fix this bug would be to add `layer:nagios` and create a python script in the templates folder. Each py file will contain one of the checks.

Here I provide my idea for nrpe check for all OpenStack networks should look like.
When the `nrpe-external-master.available` flag exists, the `check_openstack_networks.py` file will be installed as a nagios plugin. This file checks all OpenStack networks to see if they are in the ACTIVE state. If the network is in the DOWN state, raises a warning, and if another problem occurs (problem with parsing networks from OpenStack output, etc.), raises a critical error.

After verifying the correctness of my approach, I will provide more information about other checks.

Revision history for this message
Robert Gildein (rgildein) wrote :

WIP PR at https://github.com/juju-solutions/charm-openstack-integrator/pull/43

In my design, I changed only one thing, and that was creating individual py files for checks to reuse functions.

Revision history for this message
Alvaro Uria (aluria) wrote :

The approach described in #1 has been slightly modified. When a Neutron port reports "DOWN", the nagios alert raised is CRITICAL, not warning.

The PR mentioned in #2 is ready for review. As mentioned in my last comment [1], I think the use of python-openstackclient, installed via layer.yaml, will need to be reviewed. The nrpe script(s) are able to use native python libs, but it would break the approach taken until now (use a snap from the snapstore of deployed via Juju resources, in case no Internet access exists).

1. https://github.com/juju-solutions/charm-openstack-integrator/pull/43#issuecomment-778240754

tags: added: review-needed
Revision history for this message
Alvaro Uria (aluria) wrote :

The bug would not be complete, in case of approval of the previous PR, because a LB nrpe check is still missing (WIP).

Revision history for this message
Robert Gildein (rgildein) wrote :

After the discussion, we decided that NRPE checks should not be a
part of charm-openstack-integrator. Instead,
charm-openstack-service-checks should be used. Since it already has
check for Octavia LB, I will add only check for OpenStack resources.

Changed in charm-openstack-integrator:
assignee: Robert Gildein (rgildein) → nobody
Changed in charm-openstack-service-checks:
assignee: nobody → Robert Gildein (rgildein)
status: New → In Progress
Changed in charm-openstack-service-checks:
importance: Undecided → Medium
Changed in charm-openstack-service-checks:
status: In Progress → Fix Committed
George Kraft (cynerva)
Changed in charm-openstack-integrator:
status: In Progress → Won't Fix
Changed in charm-openstack-service-checks:
milestone: none → 22.10
Changed in charm-openstack-service-checks:
status: Fix Committed → Fix Released
Changed in charm-openstack-service-checks:
status: Fix Released → Fix Committed
Changed in charm-openstack-service-checks:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.