ceilometer reports ceilometer-agent-central not running

Bug #1664898 reported by Liam Young
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack AODH Charm
Invalid
Critical
Unassigned
OpenStack Ceilometer Charm
Fix Released
Critical
David Ames
ceilometer (Juju Charms Collection)
Invalid
Critical
David Ames

Bug Description

On HA deploys with liberty+ I sometimes see:

Services not running that should be: ceilometer-agent-central

Revision history for this message
Liam Young (gnuoy) wrote :

I *think* the issue is that ceilometer-agent-central service dies if the token it gets from keystone fails validation. Maybe because keystone is split brained while it comes up?

description: updated
Revision history for this message
Liam Young (gnuoy) wrote :
Download full text (6.9 KiB)

The error in ceilometer-agent-central.log is:

2017-02-15 09:16:43.724 14475 WARNING oslo_config.cfg [req-9066e130-3307-4c5f-bc11-f209497e6b22 admin - - - -] Option "api_workers" from group "DEFAULT" is deprecated. Use option "workers" from group "api".
2017-02-15 09:16:46.200 14475 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.containers.objects for polling source meter_source anymore!
2017-02-15 09:16:56.324 14475 WARNING ceilometer.agent.discovery.endpoint [-] No endpoints found for service energy
2017-02-15 09:16:58.826 14475 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.usage for polling source meter_source anymore!
2017-02-15 09:16:58.830 14475 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.containers.objects.size for polling source meter_source anymore!
2017-02-15 09:17:01.731 14475 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 10.5.38.228:5672
2017-02-15 09:17:01.749 14475 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.5.38.228:5672
2017-02-15 09:17:02.381 14475 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.objects.size for polling source meter_source anymore!
2017-02-15 09:17:02.382 14475 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.objects for polling source meter_source anymore!
2017-02-15 09:17:02.397 14475 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.objects.containers for polling source meter_source anymore!
2017-02-15 09:17:23.988 15942 WARNING oslo_config.cfg [req-06c9e15b-c02a-47cd-a76b-c1b454b8b879 admin - - - -] Option "metering_secret" from group "publisher_rpc" is deprecated. Use option "telemetry_secret" from group "publisher".
2017-02-15 09:17:23.992 15942 WARNING oslo_config.cfg [req-06c9e15b-c02a-47cd-a76b-c1b454b8b879 admin - - - -] Option "rabbit_hosts" from group "DEFAULT" is deprecated. Use option "rabbit_hosts" from group "oslo_messaging_rabbit".
2017-02-15 09:17:23.993 15942 WARNING oslo_config.cfg [req-06c9e15b-c02a-47cd-a76b-c1b454b8b879 admin - - - -] Option "rabbit_password" from group "DEFAULT" is deprecated. Use option "rabbit_password" from group "oslo_messaging_rabbit".
2017-02-15 09:17:23.993 15942 WARNING oslo_config.cfg [req-06c9e15b-c02a-47cd-a76b-c1b454b8b879 admin - - - -] Option "rabbit_userid" from group "DEFAULT" is deprecated. Use option "rabbit_userid" from group "oslo_messaging_rabbit".
2017-02-15 09:17:23.994 15942 WARNING oslo_config.cfg [req-06c9e15b-c02a-47cd-a76b-c1b454b8b879 admin - - - -] Option "rabbit_virtual_host" from group "DEFAULT" is deprecated. Use option "rabbit_virtual_host" from group "oslo_messaging_rabbit".
2017-02-15 09:17:23.994 15942 WARNING oslo_config.cfg [req-06c9e15b-c02a-47cd-a76b-c1b454b8b879 admin - - - -] Option "api_workers" from group "DEFAULT" is deprecated. Use option "workers" from group "api".
2017-02-15 09:17:24.364 15942 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.objects.containers for polling source meter_source anymore!
2017-02-15 09:17:24.856 15942 WARNING ceilometer.agent.discovery.endpoint [-] No endpoints found for service energy
2017-02-15 09:17:25.365 15942 ERROR ceilometer.agent.manager [-] Prevent pollster rgw.usag...

Read more...

Revision history for this message
David Ames (thedac) wrote :
Download full text (3.2 KiB)

Testing with the keystone change that waits on vip. I now see that ceilometer-agent-central gets stopped gracefully.

root@juju-thedac-machine-1:/home/ubuntu# systemctl status ceilometer-agent-central
● ceilometer-agent-central.service - Ceilometer Agent Central
   Loaded: loaded (/lib/systemd/system/ceilometer-agent-central.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2017-02-16 18:40:44 UTC; 46min ago
 Main PID: 15130 (code=exited, status=0/SUCCESS)

Feb 16 18:40:19 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:19.724 15130 WARNING ceilometer.agent.discovery.endpoint [req-ab1a62
Feb 16 18:40:20 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:20.305 15130 ERROR ceilometer.agent.manager [req-ab1a62aa-d4f0-4c9e-
Feb 16 18:40:20 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:20.306 15130 ERROR ceilometer.agent.manager [req-ab1a62aa-d4f0-4c9e-
Feb 16 18:40:22 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:22.007 15130 ERROR ceilometer.agent.manager [req-ab1a62aa-d4f0-4c9e-
Feb 16 18:40:22 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:22.707 15130 WARNING ceilometer.neutron_client [req-ab1a62aa-d4f0-4c
Feb 16 18:40:22 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:22.709 15130 ERROR ceilometer.agent.manager [req-ab1a62aa-d4f0-4c9e-
Feb 16 18:40:22 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:22.743 15130 ERROR ceilometer.agent.manager [req-ab1a62aa-d4f0-4c9e-
Feb 16 18:40:22 juju-thedac-machine-1 ceilometer-agent-central[15130]: 2017-02-16 18:40:22.745 15130 ERROR ceilometer.agent.manager [req-ab1a62aa-d4f0-4c9e-
Feb 16 18:40:44 juju-thedac-machine-1 systemd[1]: Stopping Ceilometer Agent Central...
Feb 16 18:40:44 juju-thedac-machine-1 systemd[1]: Stopped Ceilometer Agent Central.

root@juju-thedac-machine-1:/home/ubuntu# date
Thu Feb 16 19:27:05 UTC 2017

This suggests corosync could be to blame. And lo and behold:

root@juju-thedac-machine-1:/home/ubuntu# crm status
Last updated: Thu Feb 16 19:28:13 2017 Last change: Thu Feb 16 18:41:18 2017 by hacluster via crmd on juju-thedac-machine-1
Stack: corosync
Current DC: juju-thedac-machine-2 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 5 resources configured

Online: [ juju-thedac-machine-1 juju-thedac-machine-2 juju-thedac-machine-3 ]

Full list of resources:

 Resource Group: grp_ceilometer_vips
     res_ceilometer_ens2_vip (ocf::heartbeat:IPaddr2): Started juju-thedac-machine-2
 Clone Set: cl_ceilometer_haproxy [res_ceilometer_haproxy]
     Started: [ juju-thedac-machine-1 juju-thedac-machine-2 juju-thedac-machine-3 ]
 res_ceilometer_polling (ocf::openstack:ceilometer-polling): Started juju-thedac-machine-1

Failed Actions:
* res_ceilometer_polling_monitor_0 on juju-thedac-machine-2 'not installed' (5): call=73, status=Not installed, exitreason='none',
    last-rc-change='Thu Feb 16 18:40:44 2017', queued=0ms, exec=2ms
* res_ceilometer_polling_monitor_0 on juju-thedac-machine-3 'not installed' (5): call=73, status=Not installed, exitreason='none',
   ...

Read more...

Revision history for this message
David Ames (thedac) wrote :

One more point. This seems to only happen on the leader node.

Changed in ceilometer (Juju Charms Collection):
status: New → Triaged
importance: Undecided → Critical
milestone: none → 17.01
assignee: nobody → David Ames (thedac)
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

Investigating ceilometer-agent-central -> ceilometer-polling change:

https://github.com/openstack/charm-ceilometer/commit/3816946e2f769c37ca2ed15aa6a49e9583ad8900

Revision history for this message
David Ames (thedac) wrote :

The ceilometer-agent-central still exists but it runs /usr/bin/ceilometer-polling
The service is still called ceilometer-agent-central and has a systemd init file.

This change attempted to address this issue:
https://github.com/openstack/charm-ceilometer/commit/3816946e2f769c37ca2ed15aa6a49e9583ad8900
https://bugs.launchpad.net/charms/+source/ceilometer/+bug/1606787

However it is failing on a simple Mitaka HA deploy:
http://pastebin.ubuntu.com/24009855/
http://pastebin.ubuntu.com/24009878/

And it also fails during charm upgrade as in comment #3

< wolsen> right, but I see what happened
< wolsen> afaict, the systemd init file was provided for ceilometer-agent-central and pointed at the ceilometer-polling
< wolsen> so you could use the same service

< thedac> Interesting. So does that make the PR unnecessary?
< wolsen> hmm I'm not sure its necessary no

It seems it is at least to some degree. I tested with the above commit reverted and this also fails with:
couldn't find command: ceilometer-agent-central

This was not discovered earlier because the amulet tests do not run HA tests. And the service ceilometer-agent-central is not stopped 100% of the time. The juju status can look good but crm status shows breakage.

So we need to address 3 issues.
1) Determine if the rename is necessary (res_ceilometer_agent_central->res_ceilometer_polling). If so possibly change the required services.

2) Fix the hand off to hacluster to correctly monitor the ceilometer-polling regardless of what it is called. This could be a disparity between the service name "ceilometer_agent_central" and the command that needs to be run: /usr/bin/ceilometer-polling.

3) Handle upgrades of the charm when the previous version uses the old resource. This needs to include removal of previously setup res_ceilometer_agent_central resources.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @thedac,

The upstream ocf file [1] ( >=liberty uses [2] ) uses ocf it's own way to start process:

su ${OCF_RESKEY_user} -s /bin/sh -c "${OCF_RESKEY_binary} --config-file=$OCF_RESKEY_config \
       $OCF_RESKEY_additional_parameters"' >> /dev/null 2>&1 & echo $!' > $OCF_RESKEY_pid

So autostart service by installation package should be stopped in the ocf file after the ocf has taken over:

service ceilometer-agent-central stop >> /dev/null 2>&1

But the charm code service_running() [3] is used to determine whether one process is running, it doesn't consider the fact that many upstream ocf files are using their own ways rather than SystemV/Upstart/Systemd to manage processes, then the reported problem occurs.

So I think this problem is not caused by those two patches [4][5] that are used to fix ceilometer-agent-central -> ceilometer-polling change. We can fix this problem by directly invoking SystemV/Upstart/Systemd scripts to manage processes in ocf files, or adding some codes into the charm code service_running() to care for ocf way, I think the latter is better one. What's your view? thanks.

[1] https://review.openstack.org/#/c/351038/18/ocf/openstack/ceilometer-agent-central
[2] https://review.openstack.org/#/c/351038/18/ocf/openstack/ceilometer-polling
[3] http://pastebin.ubuntu.com/24011489/
[4] https://review.openstack.org/#/c/351038/
[5] https://review.openstack.org/#/c/384438

Revision history for this message
James Page (james-page) wrote :

Bearing in mind where we are in the cycle, I think we need to make a minimal change to move things forward for release next week.

The intent of running the central polling agent under corosync/pacemaker is to ensure it only runs in a single location in the cluster (otherwise polls are duplicated), so we should ensure that as soon as the charm is clustered, the service checks for ceilometer-agent-central are dropped in the principle charm; there is also probably an impact on restarts of the ceilometer-agent-central service as well (this will need to be delegated to crm calls I think).

Revision history for this message
James Page (james-page) wrote :

I'm not a huge fan for the OCF resources for openstack; we might want to consider moving to managing this process using the standard init mechanisms, rather than bespoke OCF resources.

David Ames (thedac)
Changed in charm-aodh:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/435540

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceilometer (master)

Reviewed: https://review.openstack.org/435540
Committed: https://git.openstack.org/cgit/openstack/charm-ceilometer/commit/?id=5951cbb7636dc1195c1c2e415c7647922336f2c5
Submitter: Jenkins
Branch: master

commit 5951cbb7636dc1195c1c2e415c7647922336f2c5
Author: David Ames <email address hidden>
Date: Fri Feb 17 07:42:57 2017 -0800

    Stop checking for ceilometer-agent-central service

    Since Liberty ceilometer-agent-central has been replaced with
    ceilometer-polling. There is some confusion there as the package
    is still named ceilometer-agent-central.

    For OpenStack releases >= Liberty stop checking for the
    ceilometer-agent-central service to be running.

    TODO for post release: remove the OCF management of the service(s).

    Change-Id: I5064ce130da1ec302245aaff5dbe93d9dab63b38
    Partial-bug: #1664898

James Page (james-page)
Changed in charm-aodh:
status: Triaged → Invalid
James Page (james-page)
Changed in charm-ceilometer:
assignee: nobody → David Ames (thedac)
importance: Undecided → Critical
status: New → In Progress
Changed in ceilometer (Juju Charms Collection):
status: In Progress → Invalid
James Page (james-page)
Changed in charm-ceilometer:
status: In Progress → Fix Released
milestone: none → 17.02
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.