upgrade from rocky to bionic results in "nova-os-api-compute.service is not running" in nagios (used to say "to-stein", but actually happens on upgrade to rocky)

Bug #1849897 reported by Drew Freiberger on 2019-10-25
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack nova-cloud-controller charm
High
Alex Kavanagh

Bug Description

After performing upgrades from xenial-queens through to bionic-stein, it has resulted in a stale nagios check defined:

nova-api-os-compute - CRITICAL: nova-api-os-compute.service is not running

When I investigate, I find nova-api-os-compute service is masked on the system.

It appears this has moved to an apache2 wsgi service ala:
/etc/apache2/sites-enabled/wsgi-api-os-compute.conf

I would like to suggest changing this nagios check to validate content availability of this wsgi service rather than checking status of the systemd service that is no longer valid.

Changed in charm-nova-cloud-controller:
status: New → Triaged
importance: Undecided → High
tags: added: openstack-upgrade
summary: - upgrade from rocky to stein results in "nova-os-api-compute.service is
- not running" in nagios
+ upgrade from rocky to bionic results in "nova-os-api-compute.service is
+ not running" in nagios (used to say "to-stein", but actually happens on
+ upgrade to rocky)
Alex Kavanagh (ajkavanagh) wrote :

So, I've reproduced it and it actually happens from the upgrade of bionic-queens (distro) -> bionic-rocky. It does move from its "own" api executable to being run under wsgi in apache2. The fix (as Drew reported) is to migrate the check to validating the wsgi service. An interim is to simply rely on the fact that the apache2 check "means" that the api is running at bionic and remove the defunct nova-api-os-compute check altogether.

I'll investigate the difficulty on the former, but put a patch in to clean up the defunct check at bionic+.

Changed in charm-nova-cloud-controller:
assignee: nobody → Alex Kavanagh (ajkavanagh)
status: Triaged → In Progress
Andrea Ieri (aieri) wrote :

As a workaround, the check can be removed by injecting updated relation data.

Example:

$ juju run -u nova-cloud-controller/0 -- relation-ids nrpe-external-master
nrpe-external-master:257

$ juju run -u nova-cloud-controller/0 -- relation-list -r257
nrpe-container/38

$ juju run -u nrpe-container/38 -- relation-get -r257 - nova-cloud-controller/0

[...checks are here...]

Save the monitors in a file, remove the nova-api-os-compute check:

$ cat monitors.lp1849897.out
monitors:
  remote:
    nrpe:
      apache2: {command: check_apache2}
      haproxy: {command: check_haproxy}
      haproxy_queue: {command: check_haproxy_queue}
      haproxy_servers: {command: check_haproxy_servers}
      memcached: {command: check_memcached}
      nova-conductor: {command: check_nova-conductor}
      nova-consoleauth: {command: check_nova-consoleauth}
      nova-novncproxy: {command: check_nova-novncproxy}
      nova-scheduler: {command: check_nova-scheduler}

Now set the amended relation data:

$ juju run -u nova-cloud-controller/0 -- relation-set -r257 monitors="$(cat monitors.lp184
9897.out)"

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers