[17.11][ocata] Host power off event does not get passed from nova-os-api to ceilometer and, in turn, to aodh | nova-compute <-> designate relation breaks event delivery to ceilometer

Bug #1738100 reported by Dmitrii Shcherbakov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
Fix Released
Undecided
Dmitrii Shcherbakov

Bug Description

A bundle is identical to this:

https://git.io/vbaYH

nova-cc -> ceilometer-agent-notification -> aodh

After consecutively powering an instance on and off I did not get an event passed to ceilometer-agent-notification (straced for recvfrom, never got anything like this from a successful event sending case http://paste.ubuntu.com/26179704/).

Tried on two clouds (a downsized non-HA cloud with 5 machines and HTTPS on and a full HA cloud without HTTPS)

Alarm definition:

openstack alarm create --type event --name instance_off --description 'Instance powered OFF' --event-type "compute.instance.power_off.*" --enable True --query "traits.instance_id=string::`openstack server show testsrv -f value -c id`" --alarm-action 'log://' --ok-action 'log://' --insufficient-data-action 'log://'

Because of that alarm never went from "insufficient data" to "alarm" state (https://git.io/vbaO0).

+--------------------------------------+-------+--------------+-------------------+----------+---------+
| alarm_id | type | name | state | severity | enabled |
+--------------------------------------+-------+--------------+-------------------+----------+---------+
| 7db14634-4b35-4a47-bdb6-1e20810316a3 | event | instance_off | insufficient data | low | True |
+--------------------------------------+-------+--------------+-------------------+----------+---------+

After doing

juju config nova-cloud-controller verbose=true debug=true && juju config ceilometer debug=true verbose=true && juju config aodh debug=true

and retrying I successfully got an alarm to the proper state:

openstack alarm list
+--------------------------------------+-------+--------------+-------+----------+---------+
| alarm_id | type | name | state | severity | enabled |
+--------------------------------------+-------+--------------+-------+----------+---------+
| 7db14634-4b35-4a47-bdb6-1e20810316a3 | event | instance_off | alarm | low | True |
+--------------------------------------+-------+--------------+-------+----------+---------+

Seems like something is not reloaded or not set up properly and events do not get to ceilometer at all.

messagingv2 is present in nova.conf

I am deploying a dummy environment to reproduce it for the third time. I verified this functionality with the previous charm release and a bundle in the "spell" below so it might be a regression.

Steps:

sudo add-apt-repository -y cloud-archive:ocata
sudo apt update && sudo apt install -yqq python-openstackclient python-aodhclient python-gnocchiclient

conjure-up dshcherb/spell-ocata-telemetry

#!/usr/bin/env bash
export OS_AUTH_URL=http://`juju run --unit keystone/0 "unit-get private-address"`:5000/v3
export OS_REGION_NAME=RegionOne
export OS_PROJECT_NAME=admin
export OS_PROJECT_DOMAIN_NAME=admin_domain
export OS_USER_DOMAIN_NAME=admin_domain
export OS_USERNAME=admin
export OS_PASSWORD=openstack
export OS_INTERFACE=public
export OS_IDENTITY_API_VERSION=3
export OS_AUTH_TYPE=password

openstack flavor create --public small --id auto --ram 512 --disk 1 --vcpus 2

openstack server create --image xenial-lxd --nic net-id=ubuntu-net --flavor small testsrv --key-name ubuntu-keypair

openstack alarm create --type event --name instance_off --description 'Instance powered OFF' --event-type "compute.instance.power_off.*" --enable True --query "traits.instance_id=string::`openstack server show testsrv -f value -c id`" --alarm-action 'log://' --ok-action 'log://' --insufficient-data-action 'log://'

openstack server stop testsrv

# wait

openstack alarm list

Tags: cpe-onsite
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.7 KiB)

If I enable admin plugins in rabbitmq and explore what happens, I see only samples in the cloud where the problem exists, not notifications which seems to indicate that the theory with nova services not emitting them (or not being able to submit) is correct.

# rabbitmq-plugins enable rabbitmq_management

# # good cloud
# python `updatedb && locate rabbitmqadmin` -V openstack list queues name message_stats.publish message_stats.deliver -u $u -p $p | grep notifications
| notifications.audit | | |
| notifications.critical | | |
| notifications.debug | | |
| notifications.error | 1 | 1 |
| notifications.info | 52 | 52 |
| notifications.sample | 906 | 906 |
| notifications.warn | | |
| notifications_designate.info | 330 | |
| versioned_notifications.error | 1 | |
| versioned_notifications.info | 17

I can also confirm that those stats increase on the "good" cloud after I start and stop an instance (2 times for start and 2 times for power off: .start and .end events respectively).

# # bad cloud
# python `updatedb && locate rabbitmqadmin` -V openstack list queues name message_stats.publish message_stats.deliver -u $u -p $p | grep notifications
| notifications.audit | | |
| notifications.critical | | |
| notifications.debug | | |
| notifications.error | | |
| notifications.info | | |
| notifications.sample | 36 | 36 |
| notifications.warn | | |
| not...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Interestingly, while the instance can be started and stopped on a bogus cloud, if I tail nova-api-os-compute.log at the time when I start an instance via `openstack server start test` I get this:

http://paste.ubuntu.com/26185989/

The actual compute node should send a notification about instance power on event so I looked at the logs there.

/var/log/nova/nova-compute.log
http://paste.ubuntu.com/26185995/

There are a lot of neutron-related errors there

2017-12-14 23:48:29.443 237576 ERROR nova.compute.manager [req-5e2c3c1c-c7b7-419d-a2b9-1a3f9cbe2e5a - - - - -] [instance: b325bf7d-d794-4fd4-8f63-e585694f01f2] An error occurred while refreshing the network cache.

while the instance is running and I can connect to it via ssh if I create the right security groups or from a vrouter namespace:

root@converged4:/var/log/nova# virsh list --all
 Id Name State
----------------------------------------------------
 12 instance-00000004 running

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Turns out the errors about were due to the fact that I've created an HA router on a non-HA neutron.

Anyway, after solving that problem I encountered something else: versioned nova notification counters were increased (not consumed though due to https://bugs.launchpad.net/ceilometer/+bug/1665449) but unversioned were not.

After staring at this for a while I saw a pattern that did not make any sense:

https://github.com/openstack/nova/blob/stable/pike/nova/conf/notifications.py#L88-L105
    cfg.StrOpt(
        'notification_format',
        choices=['unversioned', 'versioned', 'both'],
        default='both', ...

as we use a default "both" option, both topics should get messages and queues must get messages too (consumed or not).

While staring I also noticed that designate_notifications had increasing counters.

I went in and checked whether on the "good" cloud I had a relation between charm-designate and charm-nova-compute-kvm and I did not.

And this is why it was all working as expected - there is a code path in charm-nova-compute which overrides "topics" option to use notifications_designate instead of notifications:
 https://review.openstack.org/#/c/521072 (my patch to remove that)

Here is a live demo of how it works and then breaks after a designate <-> nova-compute relation is added.

https://asciinema.org/a/zrIYQZhYWkhjjDXBiRVArttxK

Added field-high to facilitate the inclusion of https://review.openstack.org/#/c/521072 and backporting of this as my current project depends on this functionality.

summary: [17.11][ocata] Host power off event does not get passed from nova-os-api
- to ceilometer and, in turn, to aodh
+ to ceilometer and, in turn, to aodh | nova-compute <-> designate
+ relation breaks event delivery to ceilometer
no longer affects: charm-nova-cloud-controller
Changed in charm-nova-compute:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (master)

Reviewed: https://review.openstack.org/521072
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=b89de21d47dc9114d20456452cd867041bcd83c5
Submitter: Zuul
Branch: master

commit b89de21d47dc9114d20456452cd867041bcd83c5
Author: Dmitrii Shcherbakov <email address hidden>
Date: Fri Nov 17 15:13:22 2017 +0300

    drop driver and topic settings in designate ctx

    A similar change will land in designate to drop this extra topic as
    well.

    Also, driver and topic overrides in a service-specific context are not a
    good idea.

    Change-Id: I804a34fb044090010ecfd2560594cc1f55e9bd21
    Closes-Bug: #1710831
    Closes-Bug: #1738100

Changed in charm-nova-compute:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (stable/17.11)

Fix proposed to branch: stable/17.11
Review: https://review.openstack.org/528304

Changed in charm-nova-compute:
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/17.11
Review: https://review.openstack.org/531334

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-nova-compute (stable/17.11)

Change abandoned by James Page (<email address hidden>) on branch: stable/17.11
Review: https://review.openstack.org/528304

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (stable/17.11)

Reviewed: https://review.openstack.org/531334
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=4adfc46f842e1aec20e739a7b6d94ab14e7f4cfe
Submitter: Zuul
Branch: stable/17.11

commit 4adfc46f842e1aec20e739a7b6d94ab14e7f4cfe
Author: Dmitrii Shcherbakov <email address hidden>
Date: Fri Nov 17 15:13:22 2017 +0300

    drop driver and topic settings in designate ctx

    A similar change will land in designate to drop this extra topic as
    well.

    Also, driver and topic overrides in a service-specific context are not a
    good idea.

    Closes-Bug: #1710831
    Closes-Bug: #1738100
    (cherry picked from commit b89de21d47dc9114d20456452cd867041bcd83c5)

    Change-Id: Ie2d50b1ccd31ac430ce09cb96aceace3a31b2b92

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (stable/17.11)

Fix proposed to branch: stable/17.11
Review: https://review.openstack.org/534506

Revision history for this message
Ante Karamatić (ivoks) wrote :

For stable you should just adjust the template to include default:

templates/liberty/nova.conf:notification_topics = notifications,{{ notification_topics }}

For master notification_topics should get a deeper review on how to approach this problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-nova-compute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/540834

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-nova-compute (master)

Reviewed: https://review.openstack.org/540834
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=6b9b02c11201d1afd960af0196eff56c6f81ffef
Submitter: Zuul
Branch: master

commit 6b9b02c11201d1afd960af0196eff56c6f81ffef
Author: Dmitrii Shcherbakov <email address hidden>
Date: Thu Feb 8 17:06:50 2018 +0300

    emit notifications on notifications_designate

    Change I804a34fb044090010ecfd2560594cc1f55e9bd21, commit hash
    b89de21d47dc9114d20456452cd867041bcd83c5 dropped notifications_designate
    completely to solve a problem described in pad.lv/1738100, however, to
    make this change backwards compatible notifications_designate should be
    used in addition to "notifications" topic used by default.

    This way designate will continue to receive notifications on
    notifications_designate from nova, and ceilometer will get notifications
    using the "notifications" topic.

    Change-Id: I245f5b263994c204a5e521dad542ed83952f54b8
    Related-Bug: #1710831
    Related-Bug: #1738100

James Page (james-page)
Changed in charm-nova-compute:
milestone: none → 18.02
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-nova-compute (stable/17.11)

Change abandoned by Dmitrii Shcherbakov (<email address hidden>) on branch: stable/17.11
Review: https://review.openstack.org/534506
Reason: Old code.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The bug was worked around via a local patch.

A proper fix is now in master and this can wait until the Queens queens charm release (at least for my use-case).

Ryan Beisner (1chb1n)
Changed in charm-nova-compute:
status: Fix Committed → Fix Released
Revision history for this message
Tytus Kurek (tkurek) wrote :

Designate <-> Nova-compute relation is obsolete starting from Queens:

https://github.com/openstack/charm-specs/blob/master/specs/queens/implemented/designate-neutron.rst

The "notifications_designate" should be removed as it's blocking bug 1659943.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.