Ceilometer-agent-compute service not running after Scale out of nova-cloud-controller application

Bug #1958126 reported by Gábor Mészáros
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Ceilometer Agent Charm
New
Undecided
Unassigned

Bug Description

On a deployment all but 2 (out of 114 units) of the ceilometer agent units are blocked in a state with "Services not running that should be: ceilometer-agent-compute".

This is mostly due to nova-compute.service is dependency of the ceilometer-agent.service which are not running or being restarted frequently.

The workaround was simply to run a systemctl restart ceilometer-agent-compute.service on the affected units.

System is ussuri/focal, charm revs.:
 cs:ceilometer-282, cs:ceilometer-agent-271
cs:~openstack-charmers-next/nova-cloud-controller-549
cs:nova-compute-334
(FCE template 21.10)

Event flow on one machine:
 - ceilometer-agent-compute.service has been restarted 40 times within 5 minutes after running juju add-unit nova-cloud-controller --to 24:lxd
 - last entry in this journalctl ceilometer-agent-compute.service is Dependency failed for Ceilometer Agent Compute (at 17:26:58)
 - nova-compute-service is full with Failed with result 'exit-code', Scheduled restart job, restart counter is at 35 (starting from 1). This series of service restart retries span the time of the ceilometer-agent.service is being restarted. At 17:26:58 there is an Failed to start Openstack Compute. event logged, but then at 17:30:13, next event is Started Openstack Compute. At this time the ceilometer-agent is not being started anymore and resulting stuck failed/stopped.
 - at between 17:30:04 and 17:30:16 the nova-compute juju application unit's cloud-compute:483 relation is being executed and finished. Unit is ready.

But there is no communication between the nova-compute unit and the ceilometer-agent juju units, hence the ceilometer-agent is never started back again and stuck stopped.

Getting out logs from the systems are hard, it is a secured environment going through locked down jump hosts.

Tags: aubergine
tags: added: aubergine
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

I believe this is a duplicate of lp:1947585

Revision history for this message
Bartosz Woronicz (mastier1) wrote :

Something similar happened to me. Juju 3.1.5 if that matters

○ ceilometer-agent-compute.service - Ceilometer Agent Compute
     Loaded: loaded (/lib/systemd/system/ceilometer-agent-compute.service; enabled; vendor preset: enabled)
Aug 03 10:44:04 murus-compute-1 systemd[1]: Started Ceilometer Agent Compute.
Aug 03 10:44:05 murus-compute-1 systemd[1]: Dependency failed for Ceilometer Agent Compute.
○ ceilometer-agent-compute.service - Ceilometer Agent Compute
     Loaded: loaded (/lib/systemd/system/ceilometer-agent-compute.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2023-08-03 10:44:05 UTC; 39min ago
   Main PID: 277555 (code=exited, status=0/SUCCESS)
        CPU: 752ms

Aug 03 10:44:04 murus-compute-1 systemd[1]: Started Ceilometer Agent Compute.
Aug 03 10:44:04 murus-compute-1 ceilometer-agent-compute[277555]: Deprecated: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
Aug 03 10:44:05 murus-compute-1 systemd[1]: Stopping Ceilometer Agent Compute...
Aug 03 10:44:05 murus-compute-1 systemd[1]: ceilometer-agent-compute.service: Deactivated successfully.
Aug 03 10:44:05 murus-compute-1 systemd[1]: Stopped Ceilometer Agent Compute.
Aug 03 10:44:05 murus-compute-1 systemd[1]: Dependency failed for Ceilometer Agent Compute.
Aug 03 10:44:05 murus-compute-1 systemd[1]: ceilometer-agent-compute.service: Job ceilometer-agent-compute.service/start failed with result 'dependency'.
root@xxx-compute-1:~# ^C

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.