add monitoring for AMQP MessagingTimeout

Bug #1936680 reported by Adam Dyess
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack heat charm
Undecided
Unassigned
OpenStack neutron-gateway charm
Undecided
Unassigned
OpenStack nova-compute charm
Undecided
Unassigned

Bug Description

I've found services which cannot connect to RabbitMQ and never seem to recover without restarting the service

Logs like this appear against the service

-nova-compute-
/var/log/nova/nova-compute.log.3.gz:2021-07-13 17:48:21.611 3204 ERROR oslo_service.periodic_task [req-6271da70-fd9b-49d3-8364-88b1eed4c43a - - - - -] Error during ComputeManager._instance_usage_audit: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID c2dd1131c3aa46aea7eb2fad90001e13

-heat-
/var/log/heat/heat-engine.log.2.gz:2021-07-14 17:15:59.913 29545 ERROR heat.engine.resource oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID bb7ee72411f94efd90974b7f63318ed1

Without monitoring of this issue, the only indication of this type of behavior is that the services aren't manageable from the openstack API.

Please consider adding monitoring of connectivity to RabbitMQ from each of the services.

Revision history for this message
Adam Dyess (addyess) wrote :

This obviously could affect other AMQP services.

Revision history for this message
Edin S (exsdev) wrote :

I can confirm this also affects neutron-gateway.

Revision history for this message
Adam Dyess (addyess) wrote :

I've added "field medium" to this bug as the same symptoms appear to creep in again and again on a customer cloud. The only resolution to the problem is to restart n-c-c and heat-engine

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers