Cinder backup appears as down

Bug #2026877 reported by Gorka Eguileor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
New
High
Unassigned

Bug Description

When doing concurrent backup operations the backup service may appear as being down and the connection with the RabbitMQ broker may be lost.

This is problematic because any monitoring service (Pacemaker, Kubernetes/OpenShift probes) will detect the service is down and take action.

This action is usually to restart the service or stop it and run it somewhere else. In both cases this will stop all ongoing operations.

Increasing the service_down_time is not great either because it also affects cinder-volume, and it's not like 60 seconds is a low time anyway.

Example of the RabbitMQ connection issue:
  2023-07-11 11:02:30.117 136067 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer

If we increase the service_down_time we will get to see complains from the backup service about not being able to report to the DB in time.
  2023-07-11 11:25:29.215 378376 WARNING oslo.service.loopingcall [None req-57cdd23b-77a2-4b92-8075-e7ff971ae80e - - - - - -] Function 'cinder.service.Service.report_state' run outlasted interval by 61.92 sec

Changed in cinder:
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.