cinder-manage db prune race condition

Bug #2060201 reported by Robert Franzke
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
New
Undecided
Unassigned

Bug Description

Description
===========

The `cinder-manage db prune` command can fail because of a integrity Error:

```
DBError detected when purging from volumes: (pymysql.err.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`cinder`.`volume_attachment`, CONSTRAINT `volume_attachment_ibfk_1` FOREIGN KEY (`volume_id`) REFERENCES `volumes` (`id`))')\n[SQL: DELETE FROM volumes WHERE volumes.deleted IS true AND volumes.deleted_at < %(deleted_at_1)s]\n[parameters: {'deleted_at_1': datetime.datetime(2024, 3, 5, 7, 8, 56, 775630)}]\n(Background on this error at: https://sqlalche.me/e/14/gkpj)."
```
This can be traced back to this line in the code: https://opendev.org/openstack/cinder/src/branch/master/cinder/db/sqlalchemy/api.py#L8070

As the deleted_age is defined inside the loop, it will be redefined every iteration with the current timestamp.
This can lead to a race condition if dependent resources are deleted at a time close to each other.

Steps to reproduce
==================
Currently none, but a example scenario that should explain the problem.

Example Scenario
================

Imagine a given situation where we have this entries in the volume_attachment table:
| id | volume | deleted | deleted_at |
| --- | ------ | ------- | ------------------- |
| 1 | 1 | 1 | 2024-03-05 09:00:00 |
| 2 | 2 | 1 | 2024-03-05 09:01:00 |

And this entries in the volumes table:

| id | deleted | deleted_at |
| --- | ------- | ------------------- |
| 1 | 1 | 2024-03-05 09:00:00 |
| 2 | 1 | 2024-03-05 09:01:00 |

When the `db prune 1` command is triggered at 2024-03-06 09:00:30, it will first cleanup the volume_attachement with the id 1, as it is more than one day old.
The volume_attachement with the id 2 won't be touched, as it less than 1 day old.

Assume that the cleanup of the volume_attachments takes one minute, the cleanup of the volumes table will be triggered with the 2024-03-06 09:01:30.
So it will try to delete both volumes, as both are more than one day old.

But as the volume_attachment with the id 2 references the volume with the id 2, the deletion of the volume 2 is not possible due to the foreign key constraint.

This could be easily circumvented by defining the deleted_age before entering the loop over the tables.

Expected result
===============
* No IntegrityError even if some depending entries are close to one another in time

Actual result
=============
* integrity Error due to resources that are not cleaned up, because of the race condition

Further comments
================
- I will provide a fix on master by just moving the definition of deleted_age to before entering the loop over the tables.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/915064

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.