Cinder Backups Fail with cinder.exception.ServiceNotFound with multiple availibility zones

Bug #2035007 reported by Alan Baghumian
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
New
Undecided
Unassigned

Bug Description

Hello There!

Cinder Backup fails to create backups if volumes reside on different availability zones.

cinder-scheduler.log:2023-09-09 21:20:05.973 855736 ERROR cinder.scheduler.manager [req-dbf44245-4df3-4c2b-b758-1af3c10a061a e40ba17bc77c462baa57b1f578b9b719 8f8ae25fbbff4025b932a9ef4add5d42 - - -] Service not found for creating backup.: cinder.exception.ServiceNotFound: Service cinder-backup could not be found.

Environment: Juju Deployed Focal/Yoga, HA Cinder:

$ juju status cinder
Model Controller Cloud/Region Version SLA Timestamp
vstack home-lab-default Home-Lab/default 2.9.37 unsupported 14:35:22-07:00

App Version Status Scale Charm Channel Rev Exposed Message
cinder 20.3.0 active 3 cinder yoga/stable 650 no Unit is ready
cinder-backup 20.3.0 active 3 cinder-backup yoga/stable 63 no Unit is ready
cinder-ceph-nvme 20.3.0 active 3 cinder-ceph yoga/stable 510 no Unit is ready
cinder-ceph-nvme-ec 20.3.0 active 3 cinder-ceph yoga/stable 510 no Unit is ready
cinder-ceph-ssd 20.3.0 active 3 cinder-ceph yoga/stable 510 no Unit is ready
cinder-hacluster active 3 hacluster 2.0.3/stable 113 no Unit is ready and clustered
cinder-mysql-router 8.0.34 active 3 mysql-router 8.0/stable 90 no Unit is ready

The default nova AZ has been disabled in the environment. az1 is configured to use cinder-ceph-nvme as well as cinder-ceph-nvme-ec. az2 is configured to use cinder-ceph-ssd. Both Ceph clusters are Quincy (17.2.6).

$ openstack availability zone list --volume
+-----------+---------------+
| Zone Name | Zone Status |
+-----------+---------------+
| az1 | available |
| nova | not available |
| az2 | available |
+-----------+---------------+

The cinder backup service is running and healthy on all cinder nodes:

$ openstack volume service list | grep cinder-backup | grep -w up
| cinder-backup | cinder | nova | enabled | up | 2023-09-09T21:38:15.000000 |

I think the issue here is that the service is still using the default availability zone "nova" while the target volumes have either az1 or az2 availability zones.

Looking in the cinder Git repository, cinder/scheduler/manager.py Line 644:

    def create_backup(self, context, backup):
        availability_zone = backup.availability_zone
        volume_id = backup.volume_id
        volume = self.db.volume_get(context, volume_id)
        try:
            # Bug #1952805: an incremental backup will already have a host set,
            # and we must respect it
            if not backup.host:
                host = self.driver.get_backup_host(volume, availability_zone)
                backup.host = host
                backup.save()
            self.backup_api.create_backup(context, backup)
        except exception.ServiceNotFound:
            self.db.volume_update(context, volume_id,
                                  {'status': volume['previous_status'],
                                   'previous_status': volume['status']})
            msg = "Service not found for creating backup."
            LOG.error(msg)

It appears that the error is originating from this line, naturally because the service is currently bound to the nova AZ. I could not find a configuration option or an action to reset this.

Juju cinder relations:

Relation provider Requirer Interface Type Message
ceph-mon-nvme:client cinder-backup:ceph ceph-client regular
cinder-backup:backup-backend cinder:backup-backend cinder-backup subordinate
cinder-ceph-nvme-ec:storage-backend cinder:storage-backend cinder-backend subordinate
cinder-ceph-nvme:storage-backend cinder:storage-backend cinder-backend subordinate
cinder-ceph-ssd:storage-backend cinder:storage-backend cinder-backend subordinate
cinder-hacluster:ha cinder:ha hacluster subordinate
cinder-mysql-router:shared-db cinder:shared-db mysql-shared subordinate
cinder:cinder-volume-service nova-cloud-controller:cinder-volume-service cinder regular
cinder:cluster cinder:cluster cinder-ha peer
glance:image-service cinder:image-service glance regular
keystone:identity-service cinder:identity-service keystone regular
rabbitmq-server:amqp cinder:amqp rabbitmq regular
vault:certificates cinder:certificates tls-certificates regular

Please let me know if you need anything or need me to perform any additional testing / debugging and it will be my pleasure to help.

Thank you,
Alan

Revision history for this message
Brian Rosmaita (brian-rosmaita) wrote :

@Alan: can you list the Block Storage API calls that you are making (and include a volume-show response for the volume you're backing up). Thanks!

Revision history for this message
Alan Baghumian (alanbach) wrote :

@Brian Here you go. Please review and let me know if you'd like me to run more tests, scenarios etc! Happy to help!

Best,
Alan

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.